Exact Sequence Interpolation with Transformers
Albert Alcalde, Giovanni Fantuzzi, Enrique Zuazua

TL;DR
This paper proves that transformers can exactly interpolate finite datasets of sequences in R^d, providing explicit construction, complexity estimates, and extending results from hardmax to softmax attention.
Contribution
The authors present a constructive method showing transformers can exactly interpolate datasets with complexity bounds independent of sequence length, using low-rank attention matrices.
Findings
Transformers can exactly interpolate datasets with complexity independent of sequence length.
Explicit construction uses low-rank matrices in self-attention, applicable to practical models.
Provides convergence guarantees for training transformers to global minima.
Abstract
We prove that transformers can exactly interpolate datasets of finite input sequences in , , with corresponding output sequences of smaller or equal length. Specifically, given sequences of arbitrary but finite lengths in and output sequences of lengths , we construct a transformer with blocks and parameters that exactly interpolates the dataset. Our construction provides complexity estimates that are independent of the input sequence length, by alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices in the self-attention mechanism, a common feature of practical transformer implementations. These results are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
