"In vivo" spam filtering: A challenge problem for data mining
Tom Fawcett

TL;DR
This paper highlights the complexities of real-world spam filtering, emphasizing its challenges and advocating for its use as a domain to advance data mining techniques in dynamic, real-world scenarios.
Contribution
It identifies key characteristics of in vivo spam filtering and argues for its importance as a challenging domain for data mining research.
Findings
Real-world spam datasets are difficult to acquire and share.
In vivo spam filtering presents unique challenges not captured by static models.
The paper advocates for focusing research on in vivo spam filtering as a rich domain.
Abstract
Spam, also known as Unsolicited Commercial Email (UCE), is the bane of email communication. Many data mining researchers have addressed the problem of detecting spam, generally by treating it as a static text classification problem. True in vivo spam filtering has characteristics that make it a rich and challenging domain for data mining. Indeed, real-world datasets with these characteristics are typically difficult to acquire and to share. This paper demonstrates some of these characteristics and argues that researchers should pursue in vivo spam filtering as an accessible domain for investigating them.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Imbalanced Data Classification Techniques · Advanced Malware Detection Techniques
