Learning Compositional Functions with Transformers from Easy-to-Hard Data
Zixuan Wang, Eshaan Nichani, Alberto Bietti, Alex Damian, Daniel Hsu, Jason D. Lee, Denny Wu

TL;DR
This paper investigates the learnability of complex compositional functions with transformers, demonstrating both theoretical limitations and practical methods for efficient learning using curriculum strategies.
Contribution
It provides the first SQ lower bounds for the task and proposes curriculum-based gradient descent methods for efficient learning of compositional functions.
Findings
SQ lower bound shows exponential sample complexity for certain learners.
Gradient descent with curriculum learning efficiently trains transformers on compositional tasks.
Both easy and hard examples are necessary for effective learning of complex functions.
Abstract
Transformer-based language models have demonstrated impressive capabilities across a range of complex reasoning tasks. Prior theoretical work exploring the expressive power of transformers has shown that they can efficiently perform multi-step reasoning tasks involving parallelizable computations. However, the learnability of such constructions, particularly the conditions on the data distribution that enable efficient learning via gradient-based optimization, remains an open question. Towards answering this question, in this work we study the learnability of the -fold composition task, which requires computing an interleaved composition of input permutations and hidden permutations, and can be expressed by a transformer with layers. On the negative front, we prove a Statistical Query (SQ) lower bound showing that any SQ learner that makes only polynomially-many…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Graph Neural Networks
