Rethinking Thinking Tokens: Understanding Why They Underperform in Practice
Sreeram Vennam, David Valente, David Herel, Ponnurangam Kumaraguru

TL;DR
This paper investigates why Thinking Tokens underperform compared to Chain-of-Thought reasoning in language models, attributing it to issues with embedding consistency and noisy gradients, and provides empirical analysis to support this.
Contribution
It offers a detailed empirical analysis explaining the underperformance of Thinking Tokens and discusses implications for future unsupervised reasoning methods in LLMs.
Findings
Thinking Tokens marginally improve performance but underperform CoT.
Single embedding reliance causes inconsistent learning signals.
Noisy gradients hinder effective reasoning in TTs.
Abstract
Thinking Tokens (TT) have been proposed as an unsupervised method to facilitate reasoning in language models. However, despite their conceptual appeal, our findings show that TTs marginally improves performance and consistently underperforms compared to Chain-of-Thought (CoT) reasoning across multiple benchmarks. We hypothesize that this underperformance stems from the reliance on a single embedding for TTs, which results in inconsistent learning signals and introduces noisy gradients. This paper provides a comprehensive empirical analysis to validate this hypothesis and discusses the implications for future research on unsupervised reasoning in LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Education and Critical Thinking Development
