Masked Mixers for Language Generation and Retrieval

Benjamin L. Badger

arXiv:2409.01482·cs.CL·March 21, 2025

Masked Mixers for Language Generation and Retrieval

Benjamin L. Badger

PDF

Open Access 1 Repo

TL;DR

This paper introduces masked mixers, replacing attention with masked convolutions, which improve input representation accuracy and efficiency in language modeling and retrieval, especially for small context windows.

Contribution

It presents masked mixers as a novel alternative to transformers, demonstrating superior input representation and retrieval performance, particularly in low-resource settings.

Findings

01

Masked mixers outperform transformers in small context window training.

02

Masked mixers achieve better retrieval performance than larger transformer models.

03

Input representation accuracy correlates with training efficiency and task performance.

Abstract

Attention mechanisms that confer selective focus on a strict subset of input elements are nearly ubiquitous in language models today. We posit there to be downside to the use of attention: most input information is lost. In support of this idea we observe poor input representation accuracy in transformers and more accurate representation in what we term masked mixers, which replace self-attention with masked convolutions. The masked mixer learns causal language modeling more efficiently than early transformer implementations and even outperforms optimized, current transformers when training on small ( $n_{c t x} < 512$ ) but not larger context windows. Evidence is presented for the hypothesis that differences in transformer and masked mixer training efficiencies for various tasks are best predicted by input representation accuracy, or equivalently global invertibility. We hypothesize that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

blbadger/maskedmixers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems

MethodsFocus