Masked Mixers for Language Generation and Retrieval
Benjamin L. Badger

TL;DR
This paper introduces masked mixers, replacing attention with masked convolutions, which improve input representation accuracy and efficiency in language modeling and retrieval, especially for small context windows.
Contribution
It presents masked mixers as a novel alternative to transformers, demonstrating superior input representation and retrieval performance, particularly in low-resource settings.
Findings
Masked mixers outperform transformers in small context window training.
Masked mixers achieve better retrieval performance than larger transformer models.
Input representation accuracy correlates with training efficiency and task performance.
Abstract
Attention mechanisms that confer selective focus on a strict subset of input elements are nearly ubiquitous in language models today. We posit there to be downside to the use of attention: most input information is lost. In support of this idea we observe poor input representation accuracy in transformers and more accurate representation in what we term masked mixers, which replace self-attention with masked convolutions. The masked mixer learns causal language modeling more efficiently than early transformer implementations and even outperforms optimized, current transformers when training on small () but not larger context windows. Evidence is presented for the hypothesis that differences in transformer and masked mixer training efficiencies for various tasks are best predicted by input representation accuracy, or equivalently global invertibility. We hypothesize that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
MethodsFocus
