TL;DR
This paper demonstrates that Chain-of-Thought reasoning improves language model generalization to out-of-distribution compound tasks, especially with finer granularity and less data, supported by theoretical and empirical analysis.
Contribution
It reveals the importance of CoT data granularity and sample efficiency for better OOD generalization, and provides theoretical insights into why CoT outperforms QA training.
Findings
Finer-grained CoT data enhances generalization performance.
CoT achieves similar accuracy with significantly less data compared to QA.
Transformer positional embeddings can further improve CoT generalization.
Abstract
Generalization to novel compound tasks under distribution shift is important for deploying transformer-based language models (LMs). This work investigates Chain-of-Thought (CoT) reasoning as a means to enhance OOD generalization. Through controlled experiments across several compound tasks, we reveal three key insights: (1) While QA-trained models achieve near-perfect in-distribution accuracy, their OOD performance degrades catastrophically, even with 10000k+ training examples; (2) the granularity of CoT data strongly correlates with generalization performance; finer-grained CoT data leads to better generalization; (3) CoT exhibits remarkable sample efficiency, matching QA performance with much less (even 80%) data. Theoretically, we demonstrate that compound tasks inherently permit shortcuts in Q-A data that misalign with true reasoning principles, while CoT forces internalization of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
