Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems
Lijia Liu, Takumi Kondo, Kyohei Atarashi, Koh Takeuchi, Jiyi Li, Shigeru Saito, Hisashi Kashima

TL;DR
This paper introduces a counterfactual evaluation framework to detect blind prompt injection attacks in LLM-based evaluation systems, significantly enhancing security with minimal performance impact.
Contribution
It formalizes blind attacks and proposes a novel SE+CFE framework that re-evaluates answers against false ground-truths to detect deception.
Findings
Standard evaluation is highly vulnerable to blind attacks.
SE+CFE framework significantly improves attack detection.
Minimal performance trade-offs observed with the new framework.
Abstract
This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques
