A pipeline and comparative study of 12 machine learning models for text classification
Annalisa Occhipinti, Louis Rogers, Claudio Angione

TL;DR
This paper compares 12 machine learning models for email spam detection, proposing a new pipeline for hyperparameter optimization and feature analysis to enhance classification accuracy and interpretability.
Contribution
It introduces a novel pipeline for optimizing hyperparameters and feature selection in text classifiers, validated on the Enron spam dataset.
Findings
Achieved up to 94% F-score in spam classification
Demonstrated the effectiveness of the pipeline in improving model performance
Provided insights into words influencing classification outcomes
Abstract
Text-based communication is highly favoured as a communication method, especially in business environments. As a result, it is often abused by sending malicious messages, e.g., spam emails, to deceive users into relaying personal information, including online accounts credentials or banking details. For this reason, many machine learning methods for text classification have been proposed and incorporated into the services of most email providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem. We present an updated survey of 12 machine learning text classifiers applied to a public spam corpus. A new pipeline is proposed to optimise hyperparameter selection and improve the models' performance by applying specific methods (based on natural language processing) in the preprocessing stage. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Text and Document Classification Technologies · Sentiment Analysis and Opinion Mining
