Detecting Phishing Emails Using Random Forest and AdaBoost Classifier Model

Nthurima, Fredrick; Mutua, Abraham; Stephen Titus, Waithaka

Center for Open Access in Science (COAS)
OPEN JOURNAL FOR INFORMATION TECHNOLOGY (OJIT)
ISSN (Online) 2620-0627 * ojit@centerprode.com

OJIT Home

2023 - Volume 6 - Number 2

Detecting Phishing Emails Using Random Forest and AdaBoost Classifier Model

Fredrick Nthurima
Kenyatta University, School of Pure and Applied Science, Nairobi, KENYA

Abraham Mutua
Kenyatta University, School of Pure and Applied Science, Nairobi, KENYA

Waithaka Stephen Titus * ORCID: 0000-0003-2113-3382
Kenyatta University, School of Pure and Applied Science, Nairobi, KENYA

Open Journal for Information Technology, 2023, 6(2), 123-136 * https://doi.org/10.32591/coas.ojit.0602.03123n
Received: 7 July 2023 ▪ Revised: 4 October 2023 ▪ Accepted: 15 November 2023

LICENCE: Creative Commons Attribution 4.0 International License.

ARTICLE (Full Text - PDF)

ABSTRACT:
Phishing attack occurs when a phishing email which is a legitimate-looking email, designed to lure the recipient into believing that it is a genuine email to open and click malicious links embedded into the email. This leads to user reveal sensitive information such as credit card number, usernames or passwords to the attacker thereby gaining entry into the compromised account. Online surveys have put phishing attack as the leading attack for web content mostly targeting financial institutions. According to a survey conducted by Ponemon Institute LLC 2017, the loss due to phishing attack is about $1.5 billion per year. This is a global threat to information security and it’s on the rise due to IoT (Internet of Things) and thus requires a better phishing detection mechanism to mitigate these loses and reputation injury. This research paper explores and reports the use of a combination of machine learning algorithms; Random Forest and AdaBoost and use of more phishing email features in improving the accuracy of phishing detection and prevention. This project will explore the existing phishing methods, investigate the effect of combining two machine learning algorithms to detect and prevent phishing attacks, design and develop a supervised classifier which can detect phishing and prevent phishing emails and test the model with existing data. A dataset consisting of both benign and phishing emails will be used to conduct a supervised learning by the model. Expected accuracy is 99.9%, False Negative (FN) and False Positive (FP) rates of 0.1% and below.

KEY WORDS: classification, algorithm, cyber security, machine learning, spam emails, cyber security, cyberattack, web attacks, intrusion detection and phishing emails, AdaBoost, Random Forest.

CORRESPONDING AUTHOR:
Fredrick Nthurima, Kenyatta University, School of Pure and Applied Science, Nairobi, KENYA.

REFERENCES:

Abdehamid, N. (2015). Multi-label rules for phishing classification. Applied Computing and Informatics, Vol. 11 (1), 29-46.

Abdelhamid, N., Thabtah, F., & Ayesh, A. (2014). Phishing detection based associative classification data mining. Expert systems with Applications Journal, 41(2014) 5948-5959.

Abdelhamid, N., & Thabtah, F. (2014). Associative Classification Approaches: Review and Comparison. Journal of Information and Knowledge Management (JIKM), 13(3).

Aburrous, M., Hossain, M., Dahal, K. P., & Thabtah, F. (2010). Experimental case studies for investigating e-banking phishing techniques and attack strategies. Journal of Cognitive Computation, 2(3), 242-253.

Afroz, S., & Greenstadt, R. (2011). PhishZoo: Detecting phishing websites by looking at them. In Fifth International Conference on Semantic Computing (18-21 September 2011). Palo Alto, California USA, IEEE.

Akinyelu, A. A., & Adewumi, A. O. (2014). Classification of phishing email using random forest machine learning technique. Journal of Applied Mathematics, vol. 2014, Article ID 425731, 6 pages.

Altaher, A., Wan, T. C., & ALmomani, A. (2012). Evolving fuzzy neural network for phishing emails detection. Journal of Computer Science, 8(7).

APWG Phishing Attack Trends Reports (2018). https://www.antiphishing.org/resources/apwg-reports/.

Basnet, R., Mukkamala, S., & Sung, A. H. (2008). Detection of phishing attacks: A machine learning approach. In Soft Computing Applications Industry (pp. 373-383). Berlin: Springer.

Behdad, M., T. French, M. Bennamoun, & L. Barone (2012). Nature-inspired techniques in the context of fraud detection. In IEEE Transactions on Systems, Man, and Cybernetics C.

Bouckaert, R. (2004). Bayesian network classifiers in Weka. In Working paper series. University of Waikato, Department of Computer Science. No. 14/2004. Hamilton, New Zealand.

Bright, M. (2011) Miller Smiles [Online] Available at: http://www.millersmiles.co.uk/ [Accessed 9 January 2016]. Computer Engineering, and Applied Computing, pp. 682-686.

Brown, S., Ofoghi, B., Ma, L., & Watters, P. (2017). Detecting phishing emails using hybrid features. In Symposia and workshops on ubiquitous, autonomic and trusted computing (UIC-ATC ‘17), IEEE, Australia.

Cranor, L. F., J. I. Hong, & Y. Zhang (2016). Cantina: A content-based approach to detecting phishing web sites. In 16th International World Wide Web Conference (WWW '07), Canada.

Cutler, A., & Breiman, L., (2007). Random forests-classification description. Department of Statistics Homepage.

Emigh, A. (2007). Phishing attacks: Information flow and chokepoints. In Phishing and countermeasures: Understanding the increasing problem of electronic identity theft, USA.

Fette, I., Sadeh, N., & Tomasic, A. (2007). Learning to detect phishing emails. In Proceedings of the 16th international conference on World Wide Web (pp. 649-656).

Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139.

Gaines, B. R., & Compton, J. P. (1995). Induction of ripple-down rules applied to modelling large databases. Intell. Inf. Syst., 5(3), 211-228.

Gupta, M., P. Prakash, R. R. Kompella, & M. Kumar (2015). PhishNet: Predictive blacklisting to detect phishing attacks. In IEEE Conference on Computer Communications.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11(1).

Han, W., Cao, Y., & Le, Y. (2015). Anti-phishing based on automated individual white-list. In 4th ACM Workshop on Digital Identity Management (DIM) (pp. 51-59). ACM USA.

Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11, 63-90.

Huber, M., Mulazzani, M., Leithner, M., Schrittwieser, S., Wondracek, G., & Weippl, E. (2011). Computer security applications. In 27th Annual Computer Security Applications Conference.

Khonji, M., A. Jones, & Y. Iraqi (2013). Phishing detection: A literature survey. IEEE Communications & Surveys Tutorials.

Ledesma, R., Chou, N., Mitchell, J. C., & Teraguchi, Y. (2014). Client-side defence against web-based identity theft. In 11th Annual Network & Distributed System Security Symposium, USA.

Mitchell, T. M. (1997). Machine learning. McGraw-Hill, New York, NY, USA.

Mohammad, R., Thabtah F., & McCluskey L. (2015B). Phishing websites dataset. Available: https://archive.ics.uci.edu/ml/datasets/Phishing. Accessed: January 2016.

Mohammad, R., Thabtah F., & McCluskey L., (2014A). Predicting phishing websites based on self-structuring neural network. Journal of Neural Computing and Applications, 25(2), 443-458.

Mohammad, R., Thabtah F., & McCluskey L., (2015A). Tutorial and critical analysis of phishing websites methods. Computer Science Review Journal, 17, 1-24.

Mohammad, R., Thabtah F., & McCluskey, L., (2014B). Intelligent rule based phishing websites classification. Journal of Information Security (2), 1-17. ISSN 17518709. IET.

Mohammad, R. M., Thabtah, F. & McCluskey, L. (2013). Predicting phishing websites using neural network trained with back-propagation. In World Congress in Computer Science, Computer Engineering, and Applied Computing (pp. 682-686). Las Vigas.

Nargundkar, S., Tiruthani, N., & Yu, W. D. (2017). PhishCatch—a phishing detection tool. In 33rd Annual IEEE International Computer Software and Applications Conference (COMPSAC '17), USA.

Nazif, M., B. Ryner, & C. Whittaker (2010). Large-scale automatic classification of phishing pages. In 17th Annual Network & Distributed System Security Symposium (NDSS '10). The Internet Society, USA.

Platt J. (1998). Fast training of SVM using sequential optimization. In Advances in kernel methods support vector learning (pp. 185-208). MIT Press, Cambridge.

Qabajeh I., Thabtah, F., & Chiclana, F. (2015). Dynamic classification rules data mining method. Journal of Management Analytics, 2(3), 233-253.
Quinlan, J. (1993). Programs for machine learning. San Mateo, CA: Morgan Kaufmann.

Sadeh, N., Fette, I., & Tomasic, A. (2017). Learning to detect phishing emails. In 16th International World Wide Web Conference (WWW '17), Canada.

Smadi, S., Aslam, N., Zhang, L., Alasem, R., & Hossain, M. A. (2015). Detection of phishing emails using data mining algorithms. Computer and Information Sciences, 1-8.

Strobel, S., Glahn, S., Moens, M. F., & Bergholz, A. (2010). New filtering approaches for phishing email. Journal of Computer Security, 18(1), 7-35.

Tan, C. L., Chiew, K. L., & Sze, S. N. (2017). Phishing webpage detection using weighted URL tokens for identity keywords retrieval. In Ibrahim, H., Iqbal, S., Teoh, S., & Mustaffa, M. (Eds.), 9th International Conference on Robotic, Vision, Signal Processing and Power Applications. Lecture Notes in Electrical Engineering, vol 398. Springer, Singapore.

Thabtah F., Mohammad R., & McCluskey L. (2016B). A dynamic self-structuring neural network model to combat phishing. In the Proceedings of the 2016 IEEE World Congress on Computational Intelligence. Vancouver, Canada.

Thabtah F., Qabajeh I., & Chiclana F. (2016A). Constrained dynamic rule induction learning. Expert Systems with Applications, 63, 74-85.

Wattenhofer, R., Burri, N., & Albrecht, K. (2015). Spamato-an extendable spam filter system. In Proceedings of the 2nd Conference on Email and Anti-Spam (CEAS '15), USA.

Witten, I. H., & Frank E. (2005). Data mining: Practical machine learning tools and techniques. Morgan Kaufmann Publishers.

Yuan, Y., & Zhang, N. (2012). Phishing detection using neural network. http://cs229.stanford.edu/proj2012/ZhangYuan-PhishingDetectionUsingNeuralNetwork.pdf.

Zhang, Y., Cranor, L. F., Hong, J. I, & Egelman, S. (2016). Phin ding phish: An evaluation of anti-phishing toolbars. In 14th Annual Network & Distributed System Security Symposium, USA.