Lexical Hints of Accuracy in LLM Reasoning Chains

Arne Vanhoyweghen; Brecht Verbeken; Andres Algaba; Vincent Ginis

arXiv:2508.15842·cs.CL·August 25, 2025

Lexical Hints of Accuracy in LLM Reasoning Chains

Arne Vanhoyweghen, Brecht Verbeken, Andres Algaba, Vincent Ginis

PDF

TL;DR

This paper investigates lexical cues in Chain-of-Thought reasoning to predict LLM accuracy, finding that uncertainty markers reliably indicate incorrect answers, especially in low-accuracy benchmarks, aiding model calibration.

Contribution

It introduces lexical markers of uncertainty as signals for LLM answer correctness, enhancing post-hoc calibration methods for safer deployment.

Findings

01

Lexical uncertainty markers strongly predict incorrect responses.

02

Sentiment shifts in CoT provide additional, weaker signals.

03

CoT length predicts correctness only in intermediate-difficulty benchmarks.

Abstract

Fine-tuning Large Language Models (LLMs) with reinforcement learning to produce an explicit Chain-of-Thought (CoT) before answering produces models that consistently raise overall performance on code, math, and general-knowledge benchmarks. However, on benchmarks where LLMs currently achieve low accuracy, such as Humanity's Last Exam (HLE), they often report high self-confidence, reflecting poor calibration. Here, we test whether measurable properties of the CoT provide reliable signals of an LLM's internal confidence in its answers. We analyze three feature classes: (i) CoT length, (ii) intra-CoT sentiment volatility, and (iii) lexicographic hints, including hedging words. Using DeepSeek-R1 and Claude 3.7 Sonnet on both Humanity's Last Exam (HLE), a frontier benchmark with very low accuracy, and Omni-MATH, a saturated benchmark of moderate difficulty, we find that lexical markers of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.