Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach
Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis,, Georgios Sakkis, Constantine D. Spyropoulos, Panagiotis Stamatopoulos

TL;DR
This paper compares Naive Bayesian and memory-based machine learning algorithms for spam email filtering, demonstrating that both outperform traditional keyword-based methods on a standard dataset.
Contribution
It provides a thorough performance comparison of Naive Bayesian and memory-based approaches for spam filtering using a standard benchmark dataset.
Findings
Both methods achieve high accuracy in spam detection.
They outperform traditional keyword-based filters.
Naive Bayesian performs comparably to memory-based learning.
Abstract
We investigate the performance of two machine learning algorithms in the context of anti-spam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far been based mostly on keyword patterns that are constructed by hand and perform poorly. The Naive Bayesian classifier has recently been suggested as an effective method to construct automatically anti-spam filters with superior performance. We investigate thoroughly the performance of the Naive Bayesian filter on a publicly available corpus, contributing towards standard benchmarks. At the same time, we compare the performance of the Naive Bayesian filter to an alternative memory-based learning approach, after introducing suitable cost-sensitive evaluation measures. Both methods achieve very accurate spam filtering, outperforming clearly the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Topic Modeling · Text and Document Classification Technologies
