TL;DR
This paper explores how chain-of-thought supervision influences transformer learning dynamics, revealing that it accelerates generalization but is limited by task complexity, and introduces a kinetic model to understand these effects.
Contribution
It introduces a kinetic modeling framework for transformer learning, characterizes the dynamic emergence of trace faithfulness, and analyzes how CoT mechanistically alters internal computations.
Findings
CoT accelerates generalization but struggles with complex tasks
A kinetic model quantifies learning speed and shape variations
Trace faithfulness is a dynamic property that develops during training
Abstract
Chain-of-thought (CoT) supervision can substantially improve transformer performance, yet the mechanisms by which models learn to follow and benefit from CoT remain poorly understood. We investigate these learning dynamics through the lens of grokking by pretraining transformers on symbolic reasoning tasks with tunable algorithmic complexity and controllable data composition to study their generalization. Models were trained under two settings: (i) producing only final answers, and (ii) emitting explicit CoT traces before answering. Our results show that while CoT generally improves task performance, its benefits depend on task complexity. To quantify these effects, we model the accuracy of the logarithmic training steps with a three-parameter logistic curve, revealing how the learning speed and shape vary with task complexity, data distribution, and the presence of CoT supervision. We…
Peer Reviews
Decision·Submitted to ICLR 2026
Writing: - The paper is written clearly and is easy to follow. Conceptual: - The paper (re-)discovers a few interesting facts, namely reasoning unfaithfulness, CoT helping in generalization and a grokking transition. - The experiments cover a number of challenging (although synthetic) reasoning tasks and are performed correctly. - The interpretations of the training curves are interesting, seem plausible and give insight into the training behavior.
Contribution: - The paper basically does a number of experiments and reports training curves on them. - The take home message is, I would assume, well known lore in the deep learning community, including grokking, CoT faithfulness and CoT generalizability. No new insight has been produced. - The overall setup is highly artificial with only synthetic tasks trained on an LLM from scratch. I am not sure that these findings would translate to pre-trained LLMs that might already exhibit different beh
1. Clear, controllable testbed \& measurements. The tasks vary algorithmic difficulty $k$ and data ratio $\phi$; the paper cleanly reports zero-/few-shot OOD accuracy and shows CoT often improves sample-efficiency (incl. a FLOPs view). 2. Architecture control. A matched-size Mamba fails to OOD-generalize while transformers succeed under CoT, highlighting inductive-bias differences.
1. Limited novel understandings. That CoT accelerates learning and improves expressiveness, and that intersection-like composition is hard, all broadly echo prior CoT understanding. The main new angle is the curve-fitting/Arrhenius narrative, which remains largely phenomenological. The Arrhenius analogy is not stress-tested (no explicit estimation of $\Delta$ or $T_{\text {eff }}$; validation reduces to trending $\hat{r}$ via Eq. (6)). 2. Kinetic self-consistency is shaky. The paper fits a logis
• The idea of fitting OOD accuracy over log training steps with a three-parameter logistic model to quantify learning speed and saturation is novel and interesting. This kinetic framing offers a new lens to analyze how CoT affects learning dynamics and the grokking transition.
• Intermediate accuracy metric definition: I understand the intermediate accuracy to be defined as a “trace overlap” measure matching the ground-truth trace to the predicted trace. However, many tasks can be solved using different but valid traces. This metric disregards all expedient but non-identical traces, potentially overstating the “knowledge gap” or “unfaithfulness phase.” A more nuanced metric accounting for alternative valid reasoning paths would strengthen the claims. • Limited evaluat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
