Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges
Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Zhiwei Steven Wu, Zhun Deng

TL;DR
This paper uncovers a vulnerability in LLM-based evaluation systems where natural-language rubrics can be subtly manipulated to systematically bias judgments, leading to persistent model misalignment and reduced evaluation accuracy.
Contribution
It introduces the concept of Rubric-Induced Preference Drift (RIPD), demonstrating how rubric edits can systematically bias LLM judgments and propagate through alignment pipelines.
Findings
Rubric edits can cause systematic preference shifts without detection.
Rubric-based attacks can reduce judgment accuracy by up to 27.9%.
Bias propagates into trained models, causing persistent drift.
Abstract
Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ZDCSlab/ripd-ultra-real-gemma2-2b-it-seed-btmodel· 2 dl2 dl
- 🤗ZDCSlab/ripd-ultra-real-gemma2-2b-it-biased-btmodel· 4 dl4 dl
- 🤗ZDCSlab/ripd-ultra-real-llama3-8b-instruct-seed-btmodel· 8 dl8 dl
- 🤗ZDCSlab/ripd-ultra-real-llama3-8b-instruct-biased-btmodel· 5 dl5 dl
- 🤗ZDCSlab/ripd-anthropic-saferlhf-gemma-2b-uncensored-v1-seed-btmodel· 42 dl42 dl
- 🤗ZDCSlab/ripd-anthropic-saferlhf-gemma-2b-uncensored-v1-biased-btmodel· 39 dl39 dl
- 🤗ZDCSlab/ripd-anthropic-saferlhf-dolphin3-llama31-8b-seed-btmodel· 8 dl8 dl
- 🤗ZDCSlab/ripd-anthropic-saferlhf-dolphin3-llama31-8b-biased-btmodel· 6 dl6 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Ethics and Social Impacts of AI
