Real-time and Zero-footprint Bag of Synthetic Syllables Algorithm for E-mail Spam Detection Using Subject Line and Short Text Fields
Stanislav Selitskiy

TL;DR
This paper introduces a fast, resource-efficient algorithm that detects spam in email subject lines and short texts by creating sparse vectors, enabling real-time filtering without additional hardware or storage.
Contribution
The paper proposes a novel zero-footprint Bag of Synthetic Syllables algorithm for real-time spam detection using simple vector comparisons on short email texts.
Findings
Effective in real SMTP traffic for one day
No need for persistent storage or hardware upgrades
Suitable for high-volume, resource-constrained environments
Abstract
Contemporary e-mail services have high availability expectations from the customers and are resource-strained because of the high-volume throughput and spam attacks. Deep Machine Learning architectures, which are resource hungry and require off-line processing due to the long processing times, are not acceptable at the front line filters. On the other hand, the bulk of the incoming spam is not sophisticated enough to bypass even the simplest algorithms. While the small fraction of the intelligent, highly mutable spam can be detected only by the deep architectures, the stress on them can be unloaded by the simple near real-time and near zero-footprint algorithms such as the Bag of Synthetic Syllables algorithm applied to the short texts of the e-mail subject lines and other short text fields. The proposed algorithm creates a circa 200 sparse dimensional hash or vector for each e-mail…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Text and Document Classification Technologies · Big Data and Digital Economy
