Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
Benjamin L. Badger, Ethan Roland

TL;DR
The paper introduces Toeplitz MLP Mixers, a low-complexity sequence model that replaces attention with Toeplitz matrix multiplication, achieving efficient training and superior information retention.
Contribution
It presents a novel transformer-like architecture that reduces computational complexity while enhancing input information retention and in-context learning performance.
Findings
TMMs achieve $ ext{O}(dn ext{ log } n)$ training complexity.
TMMs outperform comparable architectures in information retrieval benchmarks.
Trained Toeplitz layers tend to be more invertible than expected.
Abstract
Transformer-based large language models are in some respects limited by the quadratic time and space computational complexity of attention. We introduce the Toeplitz MLP Mixer (TMM), a transformer-like architecture that swaps attention for triangular-masked Toeplitz matrix multiplication over the sequence dimension resulting in time and space complexity during training and time and space at inference prefill. Despite the lack of sophisticated input modulation or state maintenance present in other sub-quadratic architectures, TMMs yield greater training efficiency in terms of loss achieved per compute and device memory. We demonstrate that TMMs are capable of retaining more input information resulting in improved copying ability, which we argue results from a lack of architectural biases. Consistent with higher input information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
