AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens
Tung-Ling Li, Yuhao Wu, Hongliang Liu

TL;DR
This paper uncovers a vulnerability in LLM-based judge systems where minimal control tokens can flip binary decisions, leading to false positives, and proposes adversarial training to mitigate this issue.
Contribution
It introduces AdvJudge-Zero, a method to discover control tokens that manipulate LLM judges, revealing a realistic reward-hacking risk and proposing defenses.
Findings
Control tokens can flip binary judgments from correct to incorrect.
Adversarial training reduces false positives significantly.
Vulnerability affects large open-weight and specialized judge models.
Abstract
Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct ``No'' judgments to incorrect ``Yes'' judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model's next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI
