When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Max Fomin

arXiv:2602.14161·cs.LG·February 17, 2026

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Max Fomin

PDF

Open Access 1 Datasets

TL;DR

This paper critically evaluates prompt injection attack classifiers using a diverse dataset and introduces a Leave-One-Dataset-Out evaluation method to reveal overestimated performance and dataset-specific shortcuts, highlighting the need for more robust detection approaches.

Contribution

It introduces LODO evaluation for out-of-distribution generalization, analyzes dataset-dependent shortcuts in features, and systematically compares existing guardrails and LLM judges, exposing their limitations.

Findings

01

Standard train-test splits overestimate performance by 8.4% AUC.

02

28% of top features are dataset-dependent shortcuts.

03

All evaluated guardrails fail on indirect attacks.

Abstract

Detecting prompt injection and jailbreak attacks is critical for deploying LLM-based agents safely. As agents increasingly process untrusted data from emails, documents, tool outputs, and external APIs, robust attack detection becomes essential. Yet current evaluation practices and production systems have fundamental limitations. We present a comprehensive analysis using a diverse benchmark of 18 datasets spanning harmful requests, jailbreaks, indirect prompt injections, and extraction attacks. We propose Leave-One-Dataset-Out (LODO) evaluation to measure true out-of-distribution generalization, revealing that the standard practice of train-test splits from the same dataset sources severely overestimates performance: aggregate metrics show an 8.4 percentage point AUC inflation, but per-dataset gaps range from 1% to 25% accuracy-exposing heterogeneous failure modes. To understand why…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

prodnull/prompt-injection-repo-dataset
dataset· 43 dl
43 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Advanced Malware Detection Techniques · Network Security and Intrusion Detection