Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers
Narsimha Chilkuri, Eric Hunsberger, Aaron Voelker, Gurshaant Malik,, Chris Eliasmith

TL;DR
This paper introduces a Legendre Memory Unit-based model for language modeling that achieves comparable accuracy to transformers with significantly less data and computational resources, and benefits from combining global self-attention.
Contribution
The paper presents a novel sequence processing architecture with better data efficiency and scalability than transformers, and demonstrates its effectiveness in language modeling tasks.
Findings
Achieves same accuracy as transformers with 10x fewer tokens
Improves loss over transformers for the same training data
Adding global self-attention further enhances performance
Abstract
Recent studies have demonstrated that the performance of transformers on the task of language modeling obeys a power-law relationship with model size over six orders of magnitude. While transformers exhibit impressive scaling, their performance hinges on processing large amounts of data, and their computational and memory requirements grow quadratically with sequence length. Motivated by these considerations, we construct a Legendre Memory Unit based model that introduces a general prior for sequence processing and exhibits an and (or better) dependency for memory and computation respectively. Over three orders of magnitude, we show that our new architecture attains the same accuracy as transformers with 10x fewer tokens. We also show that for the same amount of training our model improves the loss over transformers about as much as transformers improve over LSTMs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
