Language Modeling using LMUs: 10x Better Data Efficiency or Improved   Scaling Compared to Transformers

Narsimha Chilkuri; Eric Hunsberger; Aaron Voelker; Gurshaant Malik,; Chris Eliasmith

arXiv:2110.02402·cs.LG·October 7, 2021·5 cites

Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers

Narsimha Chilkuri, Eric Hunsberger, Aaron Voelker, Gurshaant Malik,, Chris Eliasmith

PDF

Open Access

TL;DR

This paper introduces a Legendre Memory Unit-based model for language modeling that achieves comparable accuracy to transformers with significantly less data and computational resources, and benefits from combining global self-attention.

Contribution

The paper presents a novel sequence processing architecture with better data efficiency and scalability than transformers, and demonstrates its effectiveness in language modeling tasks.

Findings

01

Achieves same accuracy as transformers with 10x fewer tokens

02

Improves loss over transformers for the same training data

03

Adding global self-attention further enhances performance

Abstract

Recent studies have demonstrated that the performance of transformers on the task of language modeling obeys a power-law relationship with model size over six orders of magnitude. While transformers exhibit impressive scaling, their performance hinges on processing large amounts of data, and their computational and memory requirements grow quadratically with sequence length. Motivated by these considerations, we construct a Legendre Memory Unit based model that introduces a general prior for sequence processing and exhibits an $O (n)$ and $O (n ln n)$ (or better) dependency for memory and computation respectively. Over three orders of magnitude, we show that our new architecture attains the same accuracy as transformers with 10x fewer tokens. We also show that for the same amount of training our model improves the loss over transformers about as much as transformers improve over LSTMs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis