Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
Anand, Umberto Cappellazzo, Stavros Petridis, Maja Pantic

TL;DR
This paper investigates attention sinks and massive activations in multimodal speech recognition models, revealing their origins and proposing a decorrelation loss to mitigate these issues, thereby improving recognition accuracy under challenging conditions.
Contribution
It is the first study to analyze attention sinks and massive activations in multimodal speech recognition LLMs and introduces a decorrelation loss to address these phenomena.
Findings
Attention sinks occur at BOS and low-semantic tokens.
Massive activations originate in MLP layers and are consistent across sink tokens.
The proposed decorrelation loss reduces attention sinks and improves WER, especially with high downsampling.
Abstract
Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
