Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

Xingjian Zhang; Tianhong Gao; Suliang Jin; Tianhao Wang; Teng Ye; Eytan Adar; Qiaozhu Mei

arXiv:2510.25860·cs.AI·February 23, 2026

Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye, Eytan Adar, Qiaozhu Mei

PDF

TL;DR

This paper introduces a collaborative framework that infers reasoning traces from label-only annotations to improve the reliability and agreement of LLM-based raters in subjective evaluation tasks.

Contribution

It presents a novel rejection sampling method to reconstruct thinking traces from label-only data, enhancing LLM rater performance and consistency.

Findings

01

Improved LLM-human agreement across multiple datasets

02

Enhanced inter-model agreement with refined guidelines

03

Effective inference of reasoning traces from label-only annotations

Abstract

Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.