Abstract: This project presents a comprehensive approach to phishing detection by utilizing email scraping, feature extraction, and machine learning, alongside integrating external services such as Seahound and Netcraft. By analyzing mailbox files with mixed HTML and mail data, it addresses the challenge of identifying malicious content within emails. The pipeline includes data extraction and cleansing, followed by Natural Language Processing (NLP) to transform textual content into meaningful features. Seahound and Netcraft add an innovative layer: Seahound analyzes URL legitimacy and reputation, while Netcraft offers historical insights into domain trustworthiness, enriching the feature set for the machine learning model. The meticulously labeled dataset distinguishes legitimate emails from phishing attempts, enabling rigorous training and evaluation of machine learning models, notably the Random Forest classifier and Support Vector Machine (SVM). The SVM model demonstrates high precision, recall, and F1-score metrics. This project underscores the synergy of email scraping, NLP, feature extraction, and machine learning, highlighting the crucial role of external services in enhancing phishing detection accuracy, thus advancing online security and protecting users from email-based cyberattacks.
Keywords: Seahound, Netcraft, Natural Language Processing (NLP), Phishing Detection.
| DOI: 10.17148/IARJSET.2024.11811