Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks
Narek Maloyan, Bislan Ashinov, Dmitry Namiot

TL;DR
This paper reveals that LLM-based evaluation systems are vulnerable to prompt-injection attacks, which can significantly manipulate their judgments, highlighting critical security concerns in deploying LLMs as evaluators.
Contribution
It introduces formal attack strategies and demonstrates their effectiveness against open-source LLM-as-a-Judge architectures, exposing security vulnerabilities.
Findings
CUA attack success rate exceeds 30%
JMA effectively alters model reasoning
Vulnerabilities highlight need for robust defenses
Abstract
Large Language Models (LLMs) are increasingly employed as evaluators (LLM-as-a-Judge) for assessing the quality of machine-generated text. This paradigm offers scalability and cost-effectiveness compared to human annotation. However, the reliability and security of such systems, particularly their robustness against adversarial manipulations, remain critical concerns. This paper investigates the vulnerability of LLM-as-a-Judge architectures to prompt-injection attacks, where malicious inputs are designed to compromise the judge's decision-making process. We formalize two primary attack strategies: Comparative Undermining Attack (CUA), which directly targets the final decision output, and Justification Manipulation Attack (JMA), which aims to alter the model's generated reasoning. Using the Greedy Coordinate Gradient (GCG) optimization method, we craft adversarial suffixes appended to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
