Only 5\% Attention Is All You Need: Efficient Long-range Document-level Neural Machine Translation
Zihan Liu, Zewei Sun, Shanbo Cheng, Shujian Huang, Mingxuan Wang

TL;DR
This paper introduces a lightweight attention mechanism that reduces the tokens attended to in Transformer-based document translation, achieving significant speedups and sparsity without sacrificing translation quality.
Contribution
It proposes a novel sparse attention method that maintains performance while reducing computational complexity by attending only 5 extpercent of tokens.
Findings
Achieves up to 95 extpercent sparsity in attention
Saves 93 extpercent of attention computation cost
Maintains translation quality with speed improvements
Abstract
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse phenomena by introducing document-level context information. One of the most important directions is to input the whole document directly to the standard Transformer model. In this case, efficiency becomes a critical concern due to the quadratic complexity of the attention module. Existing studies either focus on the encoder part, which cannot be deployed on sequence-to-sequence generation tasks, e.g., Machine Translation (MT), or suffer from a significant performance drop. In this work, we keep the translation performance while gaining 20\% speed up by introducing extra selection layer based on lightweight attention that selects a small portion of tokens to be attended. It takes advantage of the original attention to ensure performance and dimension reduction to accelerate inference.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Focus · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Linear Layer
