Lexical Hints of Accuracy in LLM Reasoning Chains
Arne Vanhoyweghen, Brecht Verbeken, Andres Algaba, Vincent Ginis

TL;DR
This paper investigates lexical cues in Chain-of-Thought reasoning to predict LLM accuracy, finding that uncertainty markers reliably indicate incorrect answers, especially in low-accuracy benchmarks, aiding model calibration.
Contribution
It introduces lexical markers of uncertainty as signals for LLM answer correctness, enhancing post-hoc calibration methods for safer deployment.
Findings
Lexical uncertainty markers strongly predict incorrect responses.
Sentiment shifts in CoT provide additional, weaker signals.
CoT length predicts correctness only in intermediate-difficulty benchmarks.
Abstract
Fine-tuning Large Language Models (LLMs) with reinforcement learning to produce an explicit Chain-of-Thought (CoT) before answering produces models that consistently raise overall performance on code, math, and general-knowledge benchmarks. However, on benchmarks where LLMs currently achieve low accuracy, such as Humanity's Last Exam (HLE), they often report high self-confidence, reflecting poor calibration. Here, we test whether measurable properties of the CoT provide reliable signals of an LLM's internal confidence in its answers. We analyze three feature classes: (i) CoT length, (ii) intra-CoT sentiment volatility, and (iii) lexicographic hints, including hedging words. Using DeepSeek-R1 and Claude 3.7 Sonnet on both Humanity's Last Exam (HLE), a frontier benchmark with very low accuracy, and Omni-MATH, a saturated benchmark of moderate difficulty, we find that lexical markers of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
