Optimization-based Prompt Injection Attack to LLM-as-a-Judge
Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, Neil Zhenqiang Gong

TL;DR
This paper introduces JudgeDeceiver, an optimization-based prompt injection attack that manipulates LLM-as-a-Judge to select attacker-controlled responses, demonstrating high effectiveness and exposing vulnerabilities in current defense methods.
Contribution
We propose JudgeDeceiver, a novel gradient-based optimization method for prompt injection attacks on LLM-as-a-Judge, significantly outperforming existing manual and jailbreak attacks.
Findings
JudgeDeceiver effectively manipulates LLM-as-a-Judge in various applications.
Existing defenses like detection methods are insufficient against JudgeDeceiver.
The attack demonstrates urgent need for new defense strategies.
Abstract
LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidates for a given question. LLM-as-a-Judge has many applications such as LLM-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. In this work, we propose JudgeDeceiver, an optimization-based prompt injection attack to LLM-as-a-Judge. JudgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response such that LLM-as-a-Judge selects the candidate response for an attacker-chosen question no matter what other candidate responses are. Specifically, we formulate finding such sequence as an optimization problem and propose a gradient based method to approximately solve it. Our extensive evaluation shows that JudgeDeceive is highly effective, and is much more effective than existing prompt injection attacks that manually craft the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Cryptography and Data Security · Adversarial Robustness in Machine Learning
MethodsSparse Evolutionary Training · Reinforcement Learning from AI Feedback
