Attention Sinks in Massively Multilingual Neural Machine Translation:Discovery, Analysis, and Mitigation
Hillary Mutisya, John Mugane

TL;DR
This paper uncovers a prevalent artifact in multilingual NMT cross-attention called 'attention sinks' caused by vocabulary design, which distorts analysis but can be mitigated with content filtering, improving interpretability.
Contribution
It identifies and analyzes the 'attention sink' artifact in NMT cross-attention, demonstrating its universality and proposing a filtering method to correct analysis.
Findings
Attention sinks capture 83-91% of cross-attention mass in NMT.
Filtering non-content tokens improves content-level similarity metrics.
Corrected analysis reveals linguistic signals and language clustering in attention patterns.
Abstract
Cross-attention patterns in neural machine translation (NMT) are widely used to study how multilingual models align linguistic structure. We report a systematic artifact in cross-attention analysis of NLLB-200 (600M): non-content tokens - primarily end-of-sequence tokens, language tags, and punctuation - capture 83 percent to 91 percent of total cross-attention mass. We term these "attention sinks," extending findings from LLMs [Xiao et al., 2023] to NMT cross-attention and identifying a causal mechanism rooted in vocabulary design rather than position bias. This artifact causes raw metrics to underestimate content-level similarity by nearly half (36.7 percent raw vs. 70.7 percent filtered), rendering uncorrected analyses unreliable. To address this, we validate a content-only filtering methodology that removes non-content tokens and renormalizes the distribution. Applying this to 1,000…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
