Provably Learning Attention with Queries

Satwik Bhattamishra; Kulin Shah; Michael Hahn; Varun Kanade

arXiv:2601.16873·cs.LG·May 5, 2026

Provably Learning Attention with Queries

Satwik Bhattamishra, Kulin Shah, Michael Hahn, Varun Kanade

PDF

TL;DR

This paper investigates the learnability of Transformer models with black-box access, providing algorithms for single-head attention, analyzing robustness to noise, and discussing the challenges of learning multi-head attention.

Contribution

It introduces query-efficient algorithms for learning single-head attention, extends to Transformers with FFNs, and analyzes the limitations of learning multi-head attention.

Findings

01

Single-head attention can be learned with O(d^2) queries.

02

Learning single-head attention adapts if FFNs are learnable.

03

Multi-head attention is not identifiable from queries without extra assumptions.

Abstract

We study the problem of learning Transformer-based sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the output of the target function. We begin with studying the learnability of the simplest formulation, that is, learning a single-head attention-based regressor with queries. We show that for a model with width $d$ , there is an elementary algorithm to learn the parameters of single-head attention with $O (d^{2})$ queries. Further, we show that if there exists an algorithm to learn ReLU feedforward networks (FFNs), then the single-head algorithm can be easily adapted to learn one-layer Transformers with single-head attention. Next, we show that, in the common regime where the head dimension $r ≪ d$ , single-head attention-based models can be learned with $O (r d)$ queries via compressed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.