Compositional Generalization from Learned Skills via CoT Training: A Theoretical and Structural Analysis for Reasoning

Xinhao Yao; Ruifeng Ren; Yun Liao; Lizhong Ding; Yong Liu

arXiv:2502.04667·cs.LG·February 13, 2026

Compositional Generalization from Learned Skills via CoT Training: A Theoretical and Structural Analysis for Reasoning

Xinhao Yao, Ruifeng Ren, Yun Liao, Lizhong Ding, Yong Liu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper provides a theoretical and structural analysis of how Chain-of-Thought training enhances reasoning in large language models by promoting compositional skill integration, leading to better generalization and internal reasoning structures.

Contribution

It formalizes the mechanisms behind CoT training's effectiveness, highlighting its role in internalizing reasoning as a compositional process and improving out-of-distribution generalization.

Findings

01

CoT training improves OOD generalization by compositional skill combination.

02

Models internalize reasoning into a multi-stage circuit structure.

03

CoT-trained models resolve intermediate results at shallower layers.

Abstract

Chain-of-Thought (CoT) training has markedly advanced the reasoning capabilities of large language models (LLMs), yet the mechanisms by which CoT training enhances generalization remain inadequately understood. In this work, we demonstrate that compositional generalization is fundamental: models systematically combine simpler learned skills during CoT training to address novel and more complex problems. Through a theoretical and structural analysis, we formalize this process: 1) Theoretically, the information-theoretic generalization bounds through distributional divergence can be decomposed into in-distribution (ID) and out-of-distribution (OOD) components. Specifically, the non-CoT models fail on OOD tasks due to unseen compositional patterns, whereas CoT-trained models achieve strong generalization by composing previously learned skills. In addition, controlled experiments and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The authors provide theoretical results to explain why it can be the case that CoT training leads to better generalization. To my knowledge these results are novel. 2. The paper is clearly written. 3. The experiments also consider noisy data to represent more closely real training data.

Weaknesses

1. The experimental set-up is not particularly novel. I think it is quite equivalent to other experimental set-ups with synthetic datasets like for example addition. In addition, for example, models are trained with for example one digit addition and some two digit and it has observed that when CoT is also provided the models generalize and perform better (see [1] for example).

Reviewer 02Rating 2Confidence 4

Strengths

The papers empirical finding is interesting that CoT training can help OOD generalization. The paper also combines theory with empirical findings, which is a good try.

Weaknesses

I think the biggest issue is that you cannot directly assume "under suffcient training, the ID generalization error approaches zero" without any justifications. This correlates to what architecture and tasks you are considering. For example, the model can't perform well on the train dataset on the star-graph problem by next-token prediction even if you trained the model suffciently[1]. Thus, in my opinion, it is better to have the view from optimization, which might be hard to analyze. Or you ca

Reviewer 03Rating 4Confidence 5

Strengths

1. This paper provides both theoretical analysis and experimental results to support their claims and a mechanism analysis is also completed, which makes the paper comprehensive. 2. The use of a controlled synthetic task allows for a crisp demonstration of the core principle—compositional generalization. 3. The presentation of this paper is clear and easy to understand.

Weaknesses

1. In my view, the results achieved by the CoT training paradigm proposed in the paper still essentially rely on the model performing single-step reasoning. The model merely learns to recognize the first two tokens in the pattern "e1, r1, r2" and complete the first step of reasoning, and then, given the input "e1, r1, r2, e2", it learns to recognize the last two tokens and complete the second step. What the model has actually learned are two distinct single-step reasoning tasks under different i

Code & Models

Repositories

chen123ctrls/t-cotmechanism
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDecision-Making and Behavioral Economics