ML and DL-based Phishing Website Detection: The Effects of Varied Size Datasets and Informative Feature Selection Techniques

Kibreab Adane; Berhanu Beyene; Mohammed Abebe

doi:10.37965/jait.2023.0269

ML and DL-based Phishing Website Detection: The Effects of Varied Size Datasets and Informative Feature Selection Techniques

Authors

Kibreab Adane Faculty of Computing & Software Engineering, Arba Minch University, Ethiopia https://orcid.org/0000-0002-3021-5059
Berhanu Beyene Ethiopian Cybersecurity Association, Addis Ababa, Ethiopia
Mohammed Abebe Faculty of Computing & Software Engineering, Arba Minch University, Ethiopia

DOI:

https://doi.org/10.37965/jait.2023.0269

Keywords:

phishing website detection, machine learning, deep learning, feature selection technique, phishing website datasets, ANOVA-F-test, mutual information

Abstract

One must interact with a specific webpage or website in order to use the Internet for communication, teamwork, and other productive activities. However, because phishing websites look benign and not all website visitors have the same knowledge and skills to inspect the trustworthiness of visited websites, they are tricked into disclosing sensitive information and making them vulnerable to malicious software attacks like ransomware. It is impossible to stop attackers from creating phishing websites, which is one of the core challenges in combating them. However, this threat can be alleviated by detecting a specific website as phishing and alerting online users to take the necessary precautions before handing over sensitive information. In this study, five machine learning (ML) and DL algorithms—cat-boost (CATB), gradient boost (GB), random forest (RF), multilayer perceptron (MLP), and deep neural network (DNN)—were tested with three different reputable datasets and two useful feature selection techniques, to assess the scalability and consistency of each classifier’s performance on varied dataset sizes. The experimental findings reveal that the CATB classifier achieved the best accuracy across all datasets (DS-1, DS-2, and DS-3) with respective values of 97.9%, 95.73%, and 98.83%. The GB classifier achieved the second-best accuracy across all datasets (DS-1, DS-2, and DS-3) with respective values of 97.16%, 95.18%, and 98.58%. MLP achieved the best computational time across all datasets (DS-1, DS-2, and DS-3) with respective values of 2, 7, and 3 seconds despite scoring the lowest accuracy across all datasets.