Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Ruomeng Ding; Yifei Pang; He Sun; Yizhong Wang; Zhiwei Steven Wu; Zhun Deng

arXiv:2602.13576·cs.CR·February 17, 2026

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Zhiwei Steven Wu, Zhun Deng

PDF

Open Access 8 Models 1 Datasets

TL;DR

This paper uncovers a vulnerability in LLM-based evaluation systems where natural-language rubrics can be subtly manipulated to systematically bias judgments, leading to persistent model misalignment and reduced evaluation accuracy.

Contribution

It introduces the concept of Rubric-Induced Preference Drift (RIPD), demonstrating how rubric edits can systematically bias LLM judgments and propagate through alignment pipelines.

Findings

01

Rubric edits can cause systematic preference shifts without detection.

02

Rubric-based attacks can reduce judgment accuracy by up to 27.9%.

03

Bias propagates into trained models, causing persistent drift.

Abstract

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

ZDCSlab/ripd-dataset
dataset· 102 dl
102 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Ethics and Social Impacts of AI