Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks

Narek Maloyan; Bislan Ashinov; Dmitry Namiot

arXiv:2505.13348·cs.CL·May 20, 2025

Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks

Narek Maloyan, Bislan Ashinov, Dmitry Namiot

PDF

Open Access

TL;DR

This paper reveals that LLM-based evaluation systems are vulnerable to prompt-injection attacks, which can significantly manipulate their judgments, highlighting critical security concerns in deploying LLMs as evaluators.

Contribution

It introduces formal attack strategies and demonstrates their effectiveness against open-source LLM-as-a-Judge architectures, exposing security vulnerabilities.

Findings

01

CUA attack success rate exceeds 30%

02

JMA effectively alters model reasoning

03

Vulnerabilities highlight need for robust defenses

Abstract

Large Language Models (LLMs) are increasingly employed as evaluators (LLM-as-a-Judge) for assessing the quality of machine-generated text. This paradigm offers scalability and cost-effectiveness compared to human annotation. However, the reliability and security of such systems, particularly their robustness against adversarial manipulations, remain critical concerns. This paper investigates the vulnerability of LLM-as-a-Judge architectures to prompt-injection attacks, where malicious inputs are designed to compromise the judge's decision-making process. We formalize two primary attack strategies: Comparative Undermining Attack (CUA), which directly targets the final decision output, and Justification Manipulation Attack (JMA), which aims to alter the model's generated reasoning. Using the Greedy Coordinate Gradient (GCG) optimization method, we craft adversarial suffixes appended to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection