Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior

Wajid Nasser

arXiv:2601.05114·cs.AI·January 9, 2026

Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior

Wajid Nasser

PDF

Open Access

TL;DR

This study reveals that LLM-based evaluators are individually consistent but differ systematically from each other, functioning as unique, stable 'fingerprints' rather than interchangeable judges, challenging assumptions of shared evaluation standards.

Contribution

The paper demonstrates that LLM evaluators have stable, individual-specific evaluation patterns, introducing the concept of 'evaluative fingerprints' and highlighting the non-interchangeability of judges.

Findings

01

Inter-judge agreement is near-zero (Krippendorff's α = 0.042).

02

Judges' evaluation patterns are highly distinguishable with up to 99.6% accuracy.

03

Judges' disagreement is structured and consistent, not random noise.

Abstract

LLM-as-judge systems promise scalable, consistent evaluation. We find the opposite: judges are consistent, but not with each other; they are consistent with themselves. Across 3,240 evaluations (9 judges x 120 unique video x pack items x 3 independent runs), inter-judge agreement is near-zero (Krippendorff's {\alpha} = 0.042). On two dimensions, judges disagree more than random noise would predict ({\alpha} < 0). Yet this disagreement isn't chaos; it's structured. A classifier identifies which judge produced an evaluation with 77.1% accuracy from rubric scores alone, rising to 89.9% with disposition features. Within model families, the signal is even stronger: GPT-4.1 and GPT-5.2 are distinguishable with 99.6% accuracy. We call this the reliability paradox: judges cannot agree on what constitutes quality, yet their disagreement patterns are so stable they function as fingerprints. Each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Computational and Text Analysis Methods · Explainable Artificial Intelligence (XAI)