Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

Lijia Liu; Takumi Kondo; Kyohei Atarashi; Koh Takeuchi; Jiyi Li; Shigeru Saito; Hisashi Kashima

arXiv:2507.23453·cs.CR·December 16, 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

Lijia Liu, Takumi Kondo, Kyohei Atarashi, Koh Takeuchi, Jiyi Li, Shigeru Saito, Hisashi Kashima

PDF

Open Access

TL;DR

This paper introduces a counterfactual evaluation framework to detect blind prompt injection attacks in LLM-based evaluation systems, significantly enhancing security with minimal performance impact.

Contribution

It formalizes blind attacks and proposes a novel SE+CFE framework that re-evaluates answers against false ground-truths to detect deception.

Findings

01

Standard evaluation is highly vulnerable to blind attacks.

02

SE+CFE framework significantly improves attack detection.

03

Minimal performance trade-offs observed with the new framework.

Abstract

This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Security and Intrusion Detection · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques