MeAJOR Corpus: A Multi-Source Dataset for Phishing Email Detection
Paulo Mendes, Eva Maia, Isabel Pra\c{c}a

TL;DR
The MeAJOR Corpus is a comprehensive, multi-source dataset of over 135,000 phishing and legitimate emails designed to improve machine learning detection models by providing diverse, high-quality training data that enhances generalizability and reproducibility.
Contribution
This paper introduces the MeAJOR Corpus, a large, multi-source phishing email dataset with engineered features, addressing limitations of existing datasets and supporting advanced detection research.
Findings
Achieved 98.34% F1 score with XGB classifier.
Demonstrated dataset's effectiveness across multiple models.
Addressed class imbalance and reproducibility challenges.
Abstract
Phishing emails continue to pose a significant threat to cybersecurity by exploiting human vulnerabilities through deceptive content and malicious payloads. While Machine Learning (ML) models are effective at detecting phishing threats, their performance largely relies on the quality and diversity of the training data. This paper presents MeAJOR (Merged email Assets from Joint Open-source Repositories) Corpus, a novel, multi-source phishing email dataset designed to overcome critical limitations in existing resources. It integrates 135894 samples representing a broad number of phishing tactics and legitimate emails, with a wide spectrum of engineered features. We evaluated the dataset's utility for phishing detection research through systematic experiments with four classification models (RF, XGB, MLP, and CNN) across multiple feature configurations. Results highlight the dataset's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Misinformation and Its Impacts · Sentiment Analysis and Opinion Mining
