Learning Linear Attention in Polynomial Time
Morris Yau, Ekin Aky\"urek, Jiayuan Mao, Joshua B. Tenenbaum, Stefanie Jegelka, Jacob Andreas

TL;DR
This paper proves that single-layer Transformers with linear attention can be learned in polynomial time using strong, agnostic PAC learning, bridging the gap between their theoretical expressivity and practical learnability.
Contribution
It introduces the first polynomial-time learnability results for linear attention Transformers, connecting them to linear predictors in an RKHS and demonstrating their practical learnability.
Findings
Linear attention models are polynomial-time learnable.
Empirical validation on tasks like automata and key-value learning.
Examples include associative memories and finite automata.
Abstract
Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Machine Learning and Algorithms · Face and Expression Recognition
MethodsAttention Is All You Need · Dense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Linear Layer
