Mega: Moving Average Equipped Gated Attention
Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham, Neubig, Jonathan May, Luke Zettlemoyer

TL;DR
Mega introduces a simple, theoretically grounded gated attention mechanism with moving average that efficiently models long sequences, improving performance across various sequence modeling tasks.
Contribution
It presents Mega, a novel attention mechanism incorporating moving average for better inductive bias and efficiency, with a linear complexity variant for long sequence modeling.
Findings
Mega outperforms existing models on multiple benchmarks.
The linear variant maintains high accuracy with reduced complexity.
Mega demonstrates versatility across language, image, and speech tasks.
Abstract
The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗mnaylor/mega-wikitext-103model· ♡ 2♡ 2
- 🤗pszemraj/mega-small-2048-C1024-tk_id-simplewiki-MR50model· 4 dl4 dl
- 🤗BEE-spoke-data/mega-ar-126m-4kmodel· 733 dl· ♡ 4733 dl♡ 4
- 🤗BEE-spoke-data/mega-encoder-small-16k-v1model· 5 dl· ♡ 45 dl♡ 4
- 🤗RichardErkhov/BEE-spoke-data_-_mega-ar-126m-4k-4bitsmodel· 2 dl2 dl
- 🤗RichardErkhov/BEE-spoke-data_-_mega-ar-126m-4k-8bitsmodel· 4 dl4 dl
Videos
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Dropout · Label Smoothing · Residual Connection
