Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
Vyas Raina, Adian Liusie, Mark Gales

TL;DR
This paper investigates the vulnerability of large language models used as judges in assessments, revealing that simple adversarial phrases can manipulate scores, especially in absolute scoring, raising concerns about their reliability.
Contribution
First analysis of adversarial robustness of judge-LLMs, proposing a surrogate attack method and demonstrating significant score inflation across unseen models.
Findings
Universal adversarial phrases can drastically inflate scores.
Judge-LLMs are more vulnerable in absolute scoring than in comparative assessment.
Scores can be manipulated even when attack phrases are transferred to unknown models.
Abstract
Large Language Models (LLMs) are powerful zero-shot assessors used in real-world situations such as assessing written exams and benchmarking systems. Despite these critical applications, no existing work has analyzed the vulnerability of judge-LLMs to adversarial manipulation. This work presents the first study on the adversarial robustness of assessment LLMs, where we demonstrate that short universal adversarial phrases can be concatenated to deceive judge LLMs to predict inflated scores. Since adversaries may not know or have access to the judge-LLMs, we propose a simple surrogate attack where a surrogate model is first attacked, and the learned attack phrase then transferred to unknown judge-LLMs. We propose a practical algorithm to determine the short universal attack phrases and demonstrate that when transferred to unseen models, scores can be drastically inflated such that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
