The Phish, The Spam, and The Valid: Generating Feature-Rich Emails for Benchmarking LLMs
Rebeka Toth, Tamas Bisztray, Nils Gruschka

TL;DR
This paper presents PhishFuzzer, a framework for generating a large, diverse dataset of metadata-rich emails with labels for phishing, spam, and valid categories, to benchmark and improve LLM-based email security systems.
Contribution
Introduction of PhishFuzzer, a novel framework that creates a large, structured, and labeled email dataset for benchmarking LLMs in email security tasks.
Findings
LLMs show varying reliability in detecting email types.
Structural metadata improves detection accuracy.
Model robustness varies against linguistic fuzzing.
Abstract
In this paper, we introduce a metadata-enriched generation framework (PhishFuzzer) that seeds real emails into Large Language Models (LLMs) to produce 23,100 diverse, structurally consistent email variants across controlled entity and length dimensions. Unlike prior corpora, our dataset features strict three-class labels (Phishing, Spam, Valid), provides full URL and attachment metadata, and annotates each email with attacker intent. Using this dataset, we benchmark two state-of-the-art LLMs (Qwen-2.5-72B and Gemini-3.1-Pro) under both Basic (body, subject) and Full (+URL, sender, attachment) settings. By applying formal confidence metrics (Task Success Rate and Confidence Index), we analyze model reliability, robustness against linguistic fuzzing, and the impact of structural metadata on detection accuracy. Our fully open-source framework and dataset provide a rigorous foundation for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Cybercrime and Law Enforcement Studies · Personal Information Management and User Behavior
