Massive Enhanced Extracted Email Features Tailored for Cosine Distance
Farshad Barahimi

TL;DR
This paper presents a method to convert Enron emails into a large, feature-rich dataset optimized for Cosine distance, enabling effective email classification with high accuracy using KNN.
Contribution
The paper introduces MeeefTCD, a novel feature extraction process tailored for Cosine distance, with extensive features and explainability for email classification.
Findings
KNN classification accuracy of 76.75% using Cosine distance
48557 features per email with only 40 non-zero features
Dataset available publicly for further research
Abstract
In this paper, the process of converting the Enron email dataset (the version cited in the preprint) to thousands of features per email for a selected set of 2400 labelled emails is explained and evaluated. The final features are tailored for Cosine distance so that the Cosine distance invertly reflect the number of top indicative words of each email that are common between the two emails in an explainable normalized fashion. The labelling is based on the leaf folder name in the Enron email dataset (the version cited in the preprint) folders tree and the 2400 emails selected consist 300 emails for each of the 8 labels. The evaluation is based on the accuracy of a k nearest neighbours majority voting classification using Cosine distance. In addition to KNN majority voting classification accuracy and confusion matrix, some statistics for the process is reported. The KNN majority voting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · User Authentication and Security Systems · Web Data Mining and Analysis
