From Sparse Dependence to Sparse Attention: Unveiling How   Chain-of-Thought Enhances Transformer Sample Efficiency

Kaiyue Wen; Huaqing Zhang; Hongzhou Lin; Jingzhao Zhang

arXiv:2410.05459·cs.LG·March 6, 2025

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency

Kaiyue Wen, Huaqing Zhang, Hongzhou Lin, Jingzhao Zhang

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that chain-of-thought enhances transformer sample efficiency by inducing sparse, interpretable attention patterns, which simplifies learning and improves reasoning performance beyond mere expressiveness gains.

Contribution

It reveals that CoT's benefits stem from inducing sparsity in attention, not just increased expressiveness, supported by theoretical analysis and experiments.

Findings

01

CoT improves sample efficiency in parity learning tasks.

02

CoT induces sparse, interpretable attention patterns.

03

Sparsity in attention layers is key to CoT's effectiveness.

Abstract

Chain-of-thought (CoT) significantly enhances the reasoning performance of large language models (LLM). While current theoretical studies often attribute this improvement to increased expressiveness and computational capacity, we argue that expressiveness is not the primary limitation in the LLM regime, as current large models will fail on simple tasks. Using a parity-learning setup, we demonstrate that CoT can substantially improve sample efficiency even when the representation power is sufficient. Specifically, with CoT, a transformer can learn the function within polynomial samples, whereas without CoT, the required sample size is exponential. Additionally, we show that CoT simplifies the learning process by introducing sparse sequential dependencies among input tokens, and leads to a sparse and interpretable attention. We validate our theoretical analysis with both synthetic and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhqwqwq/Learning-Parity-with-CoT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural dynamics and brain function

MethodsSoftmax · Attention Is All You Need