TL;DR
This paper introduces seven novel cross-domain techniques for detecting prompt injection attacks in language models, addressing limitations of existing methods like pattern matching and classifiers.
Contribution
It proposes seven diverse detection mechanisms from various disciplines and implements three in an open-source tool, significantly improving detection performance.
Findings
Local-alignment detector increases F1 from 0.033 to 0.378 on deepset dataset.
Stylometric detector adds 11.1 percentage points of F1 on an indirect-injection benchmark.
Fatigue tracker validated through a probing-campaign integration test.
Abstract
Current open-source prompt-injection detectors converge on two architectural choices: regular-expression pattern matching and fine-tuned transformer classifiers. Both share failure modes that recent work has made concrete. Regular expressions miss paraphrased attacks. Fine-tuned classifiers are vulnerable to adaptive adversaries: a 2025 NAACL Findings study reported that eight published indirect-injection defenses were bypassed with greater than fifty percent attack success rates under adaptive attacks. This work proposes seven detection techniques that each port a specific mechanism from a discipline outside large-language-model security: forensic linguistics, materials-science fatigue analysis, deception technology from network security, local-sequence alignment from bioinformatics, mechanism design from economics, spectral signal analysis from epidemiology, and taint tracking from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
