AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

Tung-Ling Li; Yuhao Wu; Hongliang Liu

arXiv:2512.17375·cs.LG·December 22, 2025

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

Tung-Ling Li, Yuhao Wu, Hongliang Liu

PDF

Open Access

TL;DR

This paper uncovers a vulnerability in LLM-based judge systems where minimal control tokens can flip binary decisions, leading to false positives, and proposes adversarial training to mitigate this issue.

Contribution

It introduces AdvJudge-Zero, a method to discover control tokens that manipulate LLM judges, revealing a realistic reward-hacking risk and proposing defenses.

Findings

01

Control tokens can flip binary judgments from correct to incorrect.

02

Adversarial training reduces false positives significantly.

03

Vulnerability affects large open-weight and specialized judge models.

Abstract

Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct ``No'' judgments to incorrect ``Yes'' judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model's next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI