When Scanners Lie: Evaluator Instability in LLM Red-Teaming
Lidor Erez, Omer Hofman, Tamir Nizri, Roman Vainshtein

TL;DR
This paper reveals that evaluator bias significantly affects the reliability of LLM vulnerability assessments and introduces a framework to improve evaluation consistency and accuracy.
Contribution
It presents a two-phase, reliability-aware evaluation framework that quantifies evaluator disagreement and employs verification to enhance assessment reliability.
Findings
22 of 25 attack categories show evaluator instability
Evaluator accuracy improved from 72% to 89%
Vulnerability scores can vary by up to 33% depending on evaluator
Abstract
Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the validity of these measurements hinges on an often-overlooked component: the evaluator who determines whether an attack has succeeded. In this study, we demonstrate that commonly used open-source scanners exhibit measurement instability that depends on the evaluator component. Consequently, changing the evaluator while keeping the attacks and model outputs constant can significantly alter the reported ASR. To tackle this problem, we present a two-phase, reliability-aware evaluation framework. In the first phase, we quantify evaluator disagreement to identify attack categories where ASR reliability cannot be assumed. In the second phase, we propose a verification-based evaluation method where evaluators are validated by an independent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation and Cyber Security · Web Application Security Vulnerabilities · Software Engineering Research
