MeAJOR Corpus: A Multi-Source Dataset for Phishing Email Detection

Paulo Mendes; Eva Maia; Isabel Pra\c{c}a

arXiv:2507.17978·cs.CR·November 7, 2025

MeAJOR Corpus: A Multi-Source Dataset for Phishing Email Detection

Paulo Mendes, Eva Maia, Isabel Pra\c{c}a

PDF

Open Access

TL;DR

The MeAJOR Corpus is a comprehensive, multi-source dataset of over 135,000 phishing and legitimate emails designed to improve machine learning detection models by providing diverse, high-quality training data that enhances generalizability and reproducibility.

Contribution

This paper introduces the MeAJOR Corpus, a large, multi-source phishing email dataset with engineered features, addressing limitations of existing datasets and supporting advanced detection research.

Findings

01

Achieved 98.34% F1 score with XGB classifier.

02

Demonstrated dataset's effectiveness across multiple models.

03

Addressed class imbalance and reproducibility challenges.

Abstract

Phishing emails continue to pose a significant threat to cybersecurity by exploiting human vulnerabilities through deceptive content and malicious payloads. While Machine Learning (ML) models are effective at detecting phishing threats, their performance largely relies on the quality and diversity of the training data. This paper presents MeAJOR (Merged email Assets from Joint Open-source Repositories) Corpus, a novel, multi-source phishing email dataset designed to overcome critical limitations in existing resources. It integrates 135894 samples representing a broad number of phishing tactics and legitimate emails, with a wide spectrum of engineered features. We evaluated the dataset's utility for phishing detection research through systematic experiments with four classification models (RF, XGB, MLP, and CNN) across multiple feature configurations. Results highlight the dataset's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Misinformation and Its Impacts · Sentiment Analysis and Opinion Mining