An evaluation of Naive Bayesian anti-spam filtering

Ion Androutsopoulos; John Koutsias; Konstantinos V. Chandrinos; George; Paliouras; Constantine D. Spyropoulos

arXiv:cs/0006013·cs.CL·September 25, 2009·528 cites

An evaluation of Naive Bayesian anti-spam filtering

Ion Androutsopoulos, John Koutsias, Konstantinos V. Chandrinos, George, Paliouras, Constantine D. Spyropoulos

PDF

Open Access

TL;DR

This paper thoroughly evaluates the effectiveness of Naive Bayesian classifiers for spam filtering, analyzing various factors affecting performance and highlighting the need for additional safety measures for practical deployment.

Contribution

It provides a comprehensive evaluation on a publicly available corpus and explores the impact of multiple factors on filter performance, which had not been previously examined.

Findings

01

Naive Bayesian filter requires safety nets for practical use.

02

Attribute-set size and training data significantly affect performance.

03

Lemmatization and stop-lists influence spam filter effectiveness.

Abstract

It has recently been argued that a Naive Bayesian classifier can be used to filter unsolicited bulk e-mail ("spam"). We conduct a thorough evaluation of this proposal on a corpus that we make publicly available, contributing towards standard benchmarks. At the same time we investigate the effect of attribute-set size, training-corpus size, lemmatization, and stop-lists on the filter's performance, issues that had not been previously explored. After introducing appropriate cost-sensitive evaluation measures, we reach the conclusion that additional safety nets are needed for the Naive Bayesian anti-spam filter to be viable in practice.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Text and Document Classification Technologies · Sentiment Analysis and Opinion Mining