On the Thinking-Language Modeling Gap in Large Language Models

Chenxi Liu; Yongqiang Chen; Tongliang Liu; James Cheng; Bo Han; Kun Zhang

arXiv:2505.12896·cs.CL·May 20, 2025

On the Thinking-Language Modeling Gap in Large Language Models

Chenxi Liu, Yongqiang Chen, Tongliang Liu, James Cheng, Bo Han, Kun Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper identifies a gap between language modeling and thought modeling in large language models, and introduces a new prompt technique called Language-of-Thoughts (LoT) to reduce biases and improve reasoning performance.

Contribution

The paper reveals the gap between language and thought modeling in LLMs and proposes the LoT prompt method to mitigate biases and enhance reasoning accuracy.

Findings

01

LoT reduces language biases in LLMs

02

LoT improves reasoning task performance

03

Bias mitigation leads to more accurate thought elicitation

Abstract

System 2 reasoning is one of the defining characteristics of intelligence, which requires slow and logical thinking. Human conducts System 2 reasoning via the language of thoughts that organizes the reasoning process as a causal sequence of mental language, or thoughts. Recently, it has been observed that System 2 reasoning can be elicited from Large Language Models (LLMs) pre-trained on large-scale natural languages. However, in this work, we show that there is a significant gap between the modeling of languages and thoughts. As language is primarily a tool for humans to share knowledge and thinking, modeling human language can easily absorb language biases into LLMs deviated from the chain of thoughts in minds. Furthermore, we show that the biases will mislead the eliciting of "thoughts" in LLMs to focus only on a biased part of the premise. To this end, we propose a new prompt…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

- The use of SCMs provides a structured, causal lens to analyze LLM reasoning failures. The formulation offers a novel theoretical contribution that explains phenomena including order sensitivity and context overlooking, and could potentially influence and inspire future work. - LoT prompt is simple, easy to apply, and model-agnostic, showing consistent improvements across multiple LLMs and tasks. Token cost studies demonstrate that the improvements do not come from increased output length. - T

Weaknesses

- Theoretical assumptions of perfect knowledge and Markov conditions are simplified and could limit broader applicability. - Model behaviors are validated by the LLM-as-judge approach, which could raise concerns. Manual verification would strengthen claims. - The behaviors and performances of LoT in more advanced and popular settings are unclear, including self-consistency[1], tree-of-thoughts[2], and ReAct [3]. [1] Wang, Xuezhi, et al. "Self-consistency improves chain of thought reasoning in

Reviewer 02Rating 4Confidence 4

Strengths

1) It is a clearly relevant motivation to understand LLMs learnt biases from our language structure. Moreover, this work tries to formally model how language structure and reasoning interact, given place to a better understanding of reasoning behaviour in LLMs. 2) They give us a way to quantify learnt reasoning biases and modify the way of prompting LLMs in order to reduce this effect. 3) A lot of experiments where done over several modern LLMs over reasonable complex datasets. Showing that in

Weaknesses

1) Notation is confusing and non standard. At the beggining is hard to parse the use of expression sets, the use of \pi for order (usually \pi is left for permutations). Definitions are not complete and misleading, for example, definition 2.1 assigns a conditional probability on l_k but l_k is part of the given sequence. 2) Even though the gap between language based reasoning and human reasoning is a generlized issue with modern LLMs, the main theorem (2.4) just cover the particular case of two

Reviewer 03Rating 6Confidence 4

Strengths

1. Precise problem setup. The paper defines the phenomenon clearly and ties it to a concrete causal model. 2. Clean factorization. It separates L-implicitness (how things are said) from q-implicitness (what context is needed) and analyzes them independently. 3. Broad evaluation. Results are reported across many datasets and models.

Weaknesses

1. Narrow training objective. The analysis focuses on autoregressive next-token prediction and doesn’t discuss masked/bidirectional or fill-in-the-middle training. 2. Missing structured baselines. Methods like self-consistency, Tree-of-Thoughts, or Graph-of-Thoughts aren’t compared under matched budgets. 3. Limited failure analysis. There’s no human study comparing when CoT fails vs. your method succeeds (and the reverse), beyond LLM-as-judge signals.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Machine Learning in Healthcare

MethodsFocus