Mega: Moving Average Equipped Gated Attention

Xuezhe Ma; Chunting Zhou; Xiang Kong; Junxian He; Liangke Gui; Graham; Neubig; Jonathan May; Luke Zettlemoyer

arXiv:2209.10655·cs.LG·January 31, 2023·36 cites

Mega: Moving Average Equipped Gated Attention

Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham, Neubig, Jonathan May, Luke Zettlemoyer

PDF

Open Access 5 Repos 6 Models 1 Video

TL;DR

Mega introduces a simple, theoretically grounded gated attention mechanism with moving average that efficiently models long sequences, improving performance across various sequence modeling tasks.

Contribution

It presents Mega, a novel attention mechanism incorporating moving average for better inductive bias and efficiency, with a linear complexity variant for long sequence modeling.

Findings

01

Mega outperforms existing models on multiple benchmarks.

02

The linear variant maintains high accuracy with reduced complexity.

03

Mega demonstrates versatility across language, image, and speech tasks.

Abstract

The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Mega: Moving Average Equipped Gated Attention· slideslive

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Dropout · Label Smoothing · Residual Connection