Optimization-based Prompt Injection Attack to LLM-as-a-Judge

Jiawen Shi; Zenghui Yuan; Yinuo Liu; Yue Huang; Pan Zhou; Lichao Sun; Neil Zhenqiang Gong

arXiv:2403.17710·cs.CR·August 26, 2025·2 cites

Optimization-based Prompt Injection Attack to LLM-as-a-Judge

Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, Neil Zhenqiang Gong

PDF

Open Access 1 Repo

TL;DR

This paper introduces JudgeDeceiver, an optimization-based prompt injection attack that manipulates LLM-as-a-Judge to select attacker-controlled responses, demonstrating high effectiveness and exposing vulnerabilities in current defense methods.

Contribution

We propose JudgeDeceiver, a novel gradient-based optimization method for prompt injection attacks on LLM-as-a-Judge, significantly outperforming existing manual and jailbreak attacks.

Findings

01

JudgeDeceiver effectively manipulates LLM-as-a-Judge in various applications.

02

Existing defenses like detection methods are insufficient against JudgeDeceiver.

03

The attack demonstrates urgent need for new defense strategies.

Abstract

LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidates for a given question. LLM-as-a-Judge has many applications such as LLM-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. In this work, we propose JudgeDeceiver, an optimization-based prompt injection attack to LLM-as-a-Judge. JudgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response such that LLM-as-a-Judge selects the candidate response for an attacker-chosen question no matter what other candidate responses are. Specifically, we formulate finding such sequence as an optimization problem and propose a gradient based method to approximately solve it. Our extensive evaluation shows that JudgeDeceive is highly effective, and is much more effective than existing prompt injection attacks that manually craft the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shijiawenwen/judgedeceiver
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Cryptography and Data Security · Adversarial Robustness in Machine Learning

MethodsSparse Evolutionary Training · Reinforcement Learning from AI Feedback