An evaluation of Naive Bayesian anti-spam filtering
Ion Androutsopoulos, John Koutsias, Konstantinos V. Chandrinos, George, Paliouras, Constantine D. Spyropoulos

TL;DR
This paper thoroughly evaluates the effectiveness of Naive Bayesian classifiers for spam filtering, analyzing various factors affecting performance and highlighting the need for additional safety measures for practical deployment.
Contribution
It provides a comprehensive evaluation on a publicly available corpus and explores the impact of multiple factors on filter performance, which had not been previously examined.
Findings
Naive Bayesian filter requires safety nets for practical use.
Attribute-set size and training data significantly affect performance.
Lemmatization and stop-lists influence spam filter effectiveness.
Abstract
It has recently been argued that a Naive Bayesian classifier can be used to filter unsolicited bulk e-mail ("spam"). We conduct a thorough evaluation of this proposal on a corpus that we make publicly available, contributing towards standard benchmarks. At the same time we investigate the effect of attribute-set size, training-corpus size, lemmatization, and stop-lists on the filter's performance, issues that had not been previously explored. After introducing appropriate cost-sensitive evaluation measures, we reach the conclusion that additional safety nets are needed for the Naive Bayesian anti-spam filter to be viable in practice.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Text and Document Classification Technologies · Sentiment Analysis and Opinion Mining
