Abstract: Email spam is one of the biggest threats to today’s Internet. To deal with this threat, many anti-spam filters have been developed. One big challenge for these filters is to predict the labels of emails in a personalized mailbox. These spam messages can lead to loss of private data as well. Modern day researchers have used some stylistic features of text messages to classify them to be ham or spam. E-mail spam detection can be greatly influenced by the presence of known words, phrases, abbreviations and idioms. This paper aims to compare different classifying techniques on different datasets collected from previous research works, and evaluate them on the basis of their accuracy, recall, and precision. The comparison has been performed between traditional machine learning techniques. Most of the time such emails are commercial. But many times, such emails may contain some phishing links that have malware. This arises the need for proposing prudent mechanism to detect or identify such spam emails so that time and memory space of the system can be saved up to a great extent. In this paper, we presented the NLP mechanism which can filter spam and non-spam emails and also categorize into different spam mails. Our proposed algorithm generates dictionary and features and trains them through machine learning for effective results.
Keywords: Naive Bayes, Support Vector Machine, Natural Language Processing, analysis.
| DOI: 10.17148/IARJSET.2021.8632