Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective
Beiduo Chen, Tiancheng Hu, Caiqi Zhang, Robert Litschko, Anna Korhonen, Barbara Plank

TL;DR
This paper explores how chain-of-thought prompting affects large language models' ability to model human label variation, revealing a decoupled mechanism where reasoning content influences accuracy and priors influence distributional ranking.
Contribution
It systematically disentangles the effects of reasoning text and model priors, showing that CoT improves accuracy but does not calibrate distributional ambiguity.
Findings
CoT content accounts for 99% of accuracy variance.
Model priors explain over 80% of distributional ranking.
Long CoT acts as a decisive decision-maker rather than a calibration tool.
Abstract
Reasoning-tuned LLMs utilizing long Chain-of-Thought (CoT) excel at single-answer tasks, yet their ability to model Human Label Variation--which requires capturing probabilistic ambiguity rather than resolving it--remains underexplored. We investigate this through systematic disentanglement experiments on distribution-based tasks, employing Cross-CoT experiments to isolate the effect of reasoning text from intrinsic model priors. We observe a distinct "decoupled mechanism": while CoT improves distributional alignment, final accuracy is dictated by CoT content (99% variance contribution), whereas distributional ranking is governed by model priors (over 80%). Step-wise analysis further shows that while CoT's influence on accuracy grows monotonically during the reasoning process, distributional structure is largely determined by LLM's intrinsic priors. These findings suggest that long CoT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
