Transformers Provably Learn Sparse XOR with Polylogarithmic Parameters
Yaomengxi Han, Debarghya Ghoshdastidar

TL;DR
This paper proves that single-layer, two-head Transformers can efficiently learn sparse XOR functions with polylogarithmic parameters, surpassing FFNNs, and demonstrates their rapid feature discovery and generalization capabilities.
Contribution
It provides the first theoretical analysis showing Transformers can learn sparse XOR with minimal parameters, breaking the FFNN parameter bottleneck.
Findings
Transformers can learn sparse XOR with O(polylog(d)) parameters.
Exact softmax attention is crucial for rapid feature discovery.
Transformers generalize well with finite data, as shown by sample complexity bounds.
Abstract
Learning sparse parity functions has become a theoretical testbed for studying feature learning in neural networks. However, existing analyses primarily focus on Feed-Forward Neural Networks (FFNNs). Meanwhile, theoretical understanding of Transformers in this setting remains limited, despite their empirical success and structural suitability for discovering sparse support over long sequences. To address this gap, we analyze how a single-layer, two-head Transformer learns the sparse XOR problem. Considering samples , where the label is defined by for some unknown , we prove that, with only trainable parameters, Transformers can successfully discover the relevant features and drive the loss for every input to nearly 0 with one gradient step. This result…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
