Abstract: The exponential growth of internet services and digital transactions has significantly increased exposure to cyber threats, particularly phishing attacks, malware dissemination, and fraudulent web activities. Malicious URLs serve as a primary attack vector in these cyber incidents, enabling adversaries to manipulate users, extract confidential information, and compromise enterprise security infrastructures. Conventional detection mechanisms, primarily based on static blacklists and signature matching, are inadequate in identifying newly generated or zero-day malicious URLs, thereby necessitating intelligent and adaptive detection strategies.

To address these limitations, this work proposes an advanced malicious URL detection framework built upon data analytics and machine learning methodologies. The system employs comprehensive feature engineering techniques to extract meaningful lexical, statistical, and structural characteristics from URLs. These features include entropy measurement, character frequency distribution, URL length, special character density, domain-related indicators, and hierarchical path depth analysis. Extracted features are standardized using structured preprocessing pipelines to ensure stability and consistency during model training and inference.

The classification core of the system is implemented using the XGBoost algorithm, selected for its robustness, high predictive performance, and capability to model complex nonlinear relationships. To further enhance reliability, a heuristic-based red flag detection layer is integrated alongside the machine learning model, forming a hybrid detection architecture. This layered approach improves resilience against obfuscation techniques and stealthy phishing strategies that attempt to evade automated detection.

The backend infrastructure is developed using Python, leveraging Scikit-Learn pipelines for preprocessing and model integration. An interactive dashboard interface enables real-time URL analysis, risk scoring, and visualization of feature contributions. Experimental evaluation demonstrates high classification accuracy, strong generalization capability on unseen datasets, and reduced false positive rates.

Overall, the proposed system delivers a scalable, efficient, and explainable malicious URL detection solution, contributing to strengthened cybersecurity defenses in modern web environments.

Keywords: Malicious URL Detection, Machine Learning, XGBoost, Data Analytics, Cybersecurity, Phishing Detection, Feature Engineering, Entropy Analysis, Scikit-Learn, SHAP.


Downloads: PDF | DOI: 10.17148/IARJSET.2026.13351

How to Cite:

[1] Vignesh S, Dr. K. Santhi, "DETECTING MALICIOUS URLs USING DATA ANALYTICS AND MACHINE LEARNING," International Advanced Research Journal in Science, Engineering and Technology (IARJSET), DOI: 10.17148/IARJSET.2026.13351

Open chat