Provably Learning Attention with Queries
Satwik Bhattamishra, Kulin Shah, Michael Hahn, Varun Kanade

TL;DR
This paper investigates the learnability of Transformer models with black-box access, providing algorithms for single-head attention, analyzing robustness to noise, and discussing the challenges of learning multi-head attention.
Contribution
It introduces query-efficient algorithms for learning single-head attention, extends to Transformers with FFNs, and analyzes the limitations of learning multi-head attention.
Findings
Single-head attention can be learned with O(d^2) queries.
Learning single-head attention adapts if FFNs are learnable.
Multi-head attention is not identifiable from queries without extra assumptions.
Abstract
We study the problem of learning Transformer-based sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the output of the target function. We begin with studying the learnability of the simplest formulation, that is, learning a single-head attention-based regressor with queries. We show that for a model with width , there is an elementary algorithm to learn the parameters of single-head attention with queries. Further, we show that if there exists an algorithm to learn ReLU feedforward networks (FFNs), then the single-head algorithm can be easily adapted to learn one-layer Transformers with single-head attention. Next, we show that, in the common regime where the head dimension , single-head attention-based models can be learned with queries via compressed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
