Beyond In-Distribution Success: Scaling Curves of CoT Granularity for Language Model Generalization

Ru Wang; Wei Huang; Selena Song; Haoyu Zhang; Qian Niu; Yusuke Iwasawa; Yutaka Matsuo; Jiaxian Guo

arXiv:2502.18273·cs.CL·March 31, 2026

Beyond In-Distribution Success: Scaling Curves of CoT Granularity for Language Model Generalization

Ru Wang, Wei Huang, Selena Song, Haoyu Zhang, Qian Niu, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo

PDF

1 Repo

TL;DR

This paper demonstrates that Chain-of-Thought reasoning improves language model generalization to out-of-distribution compound tasks, especially with finer granularity and less data, supported by theoretical and empirical analysis.

Contribution

It reveals the importance of CoT data granularity and sample efficiency for better OOD generalization, and provides theoretical insights into why CoT outperforms QA training.

Findings

01

Finer-grained CoT data enhances generalization performance.

02

CoT achieves similar accuracy with significantly less data compared to QA.

03

Transformer positional embeddings can further improve CoT generalization.

Abstract

Generalization to novel compound tasks under distribution shift is important for deploying transformer-based language models (LMs). This work investigates Chain-of-Thought (CoT) reasoning as a means to enhance OOD generalization. Through controlled experiments across several compound tasks, we reveal three key insights: (1) While QA-trained models achieve near-perfect in-distribution accuracy, their OOD performance degrades catastrophically, even with 10000k+ training examples; (2) the granularity of CoT data strongly correlates with generalization performance; finer-grained CoT data leads to better generalization; (3) CoT exhibits remarkable sample efficiency, matching QA performance with much less (even 80%) data. Theoretically, we demonstrate that compound tasks inherently permit shortcuts in Q-A data that misalign with true reasoning principles, while CoT forces internalization of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

physicsru/Scaling-Curves-of-CoT-Granularity-for-Language-Model-Generalization
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.