Can Reasoning Help Large Language Models Capture Human Annotator Disagreement?

Jingwei Ni; Yu Fan; Vil\'em Zouhar; Donya Rooein; Alexander Hoyle; Mrinmaya Sachan; Markus Leippold; Dirk Hovy; Elliott Ash

arXiv:2506.19467·cs.CL·January 13, 2026

Can Reasoning Help Large Language Models Capture Human Annotator Disagreement?

Jingwei Ni, Yu Fan, Vil\'em Zouhar, Donya Rooein, Alexander Hoyle, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Elliott Ash

PDF

Open Access 1 Video

TL;DR

This paper investigates whether reasoning techniques in large language models help capture human annotator disagreement, revealing that some reasoning methods improve or degrade performance depending on the context.

Contribution

The study systematically evaluates various reasoning settings in LLMs for disagreement modeling, highlighting that naive Chain-of-Thought reasoning can enhance performance while RLVR-style reasoning may harm it.

Findings

01

RLVR-style reasoning degrades disagreement modeling performance.

02

Naive Chain-of-Thought reasoning improves RLHF LLMs' disagreement modeling.

03

Replacing human annotators with reasoning LLMs can be risky when disagreements matter.

Abstract

Variation in human annotation (i.e., disagreements) is common in NLP, often reflecting important information like task subjectivity and sample ambiguity. Modeling this variation is important for applications that are sensitive to such information. Although RLVR-style reasoning (Reinforcement Learning with Verifiable Rewards) has improved Large Language Model (LLM) performance on many tasks, it remains unclear whether such reasoning enables LLMs to capture informative variation in human annotation. In this work, we evaluate the influence of different reasoning settings on LLM disagreement modeling. We systematically evaluate each reasoning setting across model sizes, distribution expression methods, and steering methods, resulting in 60 experimental setups across 3 tasks. Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling, while naive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Can Reasoning Help Large Language Models Capture Human Annotator Disagreement?· underline

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Topic Modeling · Explainable Artificial Intelligence (XAI)