"In vivo" spam filtering: A challenge problem for data mining

Tom Fawcett

arXiv:cs/0405007·cs.AI·May 23, 2007·48 cites

"In vivo" spam filtering: A challenge problem for data mining

Tom Fawcett

PDF

Open Access

TL;DR

This paper highlights the complexities of real-world spam filtering, emphasizing its challenges and advocating for its use as a domain to advance data mining techniques in dynamic, real-world scenarios.

Contribution

It identifies key characteristics of in vivo spam filtering and argues for its importance as a challenging domain for data mining research.

Findings

01

Real-world spam datasets are difficult to acquire and share.

02

In vivo spam filtering presents unique challenges not captured by static models.

03

The paper advocates for focusing research on in vivo spam filtering as a rich domain.

Abstract

Spam, also known as Unsolicited Commercial Email (UCE), is the bane of email communication. Many data mining researchers have addressed the problem of detecting spam, generally by treating it as a static text classification problem. True in vivo spam filtering has characteristics that make it a rich and challenging domain for data mining. Indeed, real-world datasets with these characteristics are typically difficult to acquire and to share. This paper demonstrates some of these characteristics and argues that researchers should pursue in vivo spam filtering as an accessible domain for investigating them.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Imbalanced Data Classification Techniques · Advanced Malware Detection Techniques