On the Thinking-Language Modeling Gap in Large Language Models
Chenxi Liu, Yongqiang Chen, Tongliang Liu, James Cheng, Bo Han, Kun Zhang

TL;DR
This paper identifies a gap between language modeling and thought modeling in large language models, and introduces a new prompt technique called Language-of-Thoughts (LoT) to reduce biases and improve reasoning performance.
Contribution
The paper reveals the gap between language and thought modeling in LLMs and proposes the LoT prompt method to mitigate biases and enhance reasoning accuracy.
Findings
LoT reduces language biases in LLMs
LoT improves reasoning task performance
Bias mitigation leads to more accurate thought elicitation
Abstract
System 2 reasoning is one of the defining characteristics of intelligence, which requires slow and logical thinking. Human conducts System 2 reasoning via the language of thoughts that organizes the reasoning process as a causal sequence of mental language, or thoughts. Recently, it has been observed that System 2 reasoning can be elicited from Large Language Models (LLMs) pre-trained on large-scale natural languages. However, in this work, we show that there is a significant gap between the modeling of languages and thoughts. As language is primarily a tool for humans to share knowledge and thinking, modeling human language can easily absorb language biases into LLMs deviated from the chain of thoughts in minds. Furthermore, we show that the biases will mislead the eliciting of "thoughts" in LLMs to focus only on a biased part of the premise. To this end, we propose a new prompt…
Peer Reviews
Decision·ICLR 2026 Poster
- The use of SCMs provides a structured, causal lens to analyze LLM reasoning failures. The formulation offers a novel theoretical contribution that explains phenomena including order sensitivity and context overlooking, and could potentially influence and inspire future work. - LoT prompt is simple, easy to apply, and model-agnostic, showing consistent improvements across multiple LLMs and tasks. Token cost studies demonstrate that the improvements do not come from increased output length. - T
- Theoretical assumptions of perfect knowledge and Markov conditions are simplified and could limit broader applicability. - Model behaviors are validated by the LLM-as-judge approach, which could raise concerns. Manual verification would strengthen claims. - The behaviors and performances of LoT in more advanced and popular settings are unclear, including self-consistency[1], tree-of-thoughts[2], and ReAct [3]. [1] Wang, Xuezhi, et al. "Self-consistency improves chain of thought reasoning in
1) It is a clearly relevant motivation to understand LLMs learnt biases from our language structure. Moreover, this work tries to formally model how language structure and reasoning interact, given place to a better understanding of reasoning behaviour in LLMs. 2) They give us a way to quantify learnt reasoning biases and modify the way of prompting LLMs in order to reduce this effect. 3) A lot of experiments where done over several modern LLMs over reasonable complex datasets. Showing that in
1) Notation is confusing and non standard. At the beggining is hard to parse the use of expression sets, the use of \pi for order (usually \pi is left for permutations). Definitions are not complete and misleading, for example, definition 2.1 assigns a conditional probability on l_k but l_k is part of the given sequence. 2) Even though the gap between language based reasoning and human reasoning is a generlized issue with modern LLMs, the main theorem (2.4) just cover the particular case of two
1. Precise problem setup. The paper defines the phenomenon clearly and ties it to a concrete causal model. 2. Clean factorization. It separates L-implicitness (how things are said) from q-implicitness (what context is needed) and analyzes them independently. 3. Broad evaluation. Results are reported across many datasets and models.
1. Narrow training objective. The analysis focuses on autoregressive next-token prediction and doesn’t discuss masked/bidirectional or fill-in-the-middle training. 2. Missing structured baselines. Methods like self-consistency, Tree-of-Thoughts, or Graph-of-Thoughts aren’t compared under matched budgets. 3. Limited failure analysis. There’s no human study comparing when CoT fails vs. your method succeeds (and the reverse), beyond LLM-as-judge signals.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Machine Learning in Healthcare
MethodsFocus
