Spam filtering by quantitative profiles
M. Grend\'ar, J. \v{S}kutov\'a, V. \v{S}pitalsk\'y

TL;DR
This paper introduces a quantitative profile method for spam filtering, representing emails as fixed-dimensional numerical vectors, achieving comparable performance to traditional methods at low computational costs.
Contribution
It proposes a novel quantitative profile approach using line and character profiles, offering an alternative to bag-of-words for spam filtering.
Findings
Quantitative profiles perform comparably to heuristic rules and naive Bayes.
The method is computationally efficient.
Evaluations on multiple datasets demonstrate effectiveness.
Abstract
Instead of the 'bag-of-words' representation, in the quantitative profile approach to spam filtering and email categorization, an email is represented by an m-dimensional vector of numbers, with m fixed in advance. Inspired by Sroufe et al. [Sroufe, P., Phithakkitnukoon, S., Dantu, R., and Cangussu, J. (2010). Email shape analysis. In \emph{LNCS}, 5935, pp. 18-29] two instances of quantitative profiles are considered: line profile and character profile. Performance of these profiles is studied on the TREC 2007, CEAS 2008 and a private corpuses. At low computational costs, the two quantitative profiles achieve performance that is at least comparable to that of heuristic rules and naive Bayes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Authorship Attribution and Profiling · Text and Document Classification Technologies
