When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift
Max Fomin

TL;DR
This paper critically evaluates prompt injection attack classifiers using a diverse dataset and introduces a Leave-One-Dataset-Out evaluation method to reveal overestimated performance and dataset-specific shortcuts, highlighting the need for more robust detection approaches.
Contribution
It introduces LODO evaluation for out-of-distribution generalization, analyzes dataset-dependent shortcuts in features, and systematically compares existing guardrails and LLM judges, exposing their limitations.
Findings
Standard train-test splits overestimate performance by 8.4% AUC.
28% of top features are dataset-dependent shortcuts.
All evaluated guardrails fail on indirect attacks.
Abstract
Detecting prompt injection and jailbreak attacks is critical for deploying LLM-based agents safely. As agents increasingly process untrusted data from emails, documents, tool outputs, and external APIs, robust attack detection becomes essential. Yet current evaluation practices and production systems have fundamental limitations. We present a comprehensive analysis using a diverse benchmark of 18 datasets spanning harmful requests, jailbreaks, indirect prompt injections, and extraction attacks. We propose Leave-One-Dataset-Out (LODO) evaluation to measure true out-of-distribution generalization, revealing that the standard practice of train-test splits from the same dataset sources severely overestimates performance: aggregate metrics show an 8.4 percentage point AUC inflation, but per-dataset gaps range from 1% to 25% accuracy-exposing heterogeneous failure modes. To understand why…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Advanced Malware Detection Techniques · Network Security and Intrusion Detection
