Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on   Zero-shot LLM Assessment

Vyas Raina; Adian Liusie; Mark Gales

arXiv:2402.14016·cs.CL·July 8, 2024·2 cites

Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

Vyas Raina, Adian Liusie, Mark Gales

PDF

Open Access 1 Repo

TL;DR

This paper investigates the vulnerability of large language models used as judges in assessments, revealing that simple adversarial phrases can manipulate scores, especially in absolute scoring, raising concerns about their reliability.

Contribution

First analysis of adversarial robustness of judge-LLMs, proposing a surrogate attack method and demonstrating significant score inflation across unseen models.

Findings

01

Universal adversarial phrases can drastically inflate scores.

02

Judge-LLMs are more vulnerable in absolute scoring than in comparative assessment.

03

Scores can be manipulated even when attack phrases are transferred to unknown models.

Abstract

Large Language Models (LLMs) are powerful zero-shot assessors used in real-world situations such as assessing written exams and benchmarking systems. Despite these critical applications, no existing work has analyzed the vulnerability of judge-LLMs to adversarial manipulation. This work presents the first study on the adversarial robustness of assessment LLMs, where we demonstrate that short universal adversarial phrases can be concatenated to deceive judge LLMs to predict inflated scores. Since adversaries may not know or have access to the judge-LLMs, we propose a simple surrogate attack where a surrogate model is first attacked, and the learned attack phrase then transferred to unknown judge-LLMs. We propose a practical algorithm to determine the short universal attack phrases and demonstrate that when transferred to unseen models, scores can be drastically inflated such that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rainavyas/attack-comparative-assessment
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning