On the Nature of Attention Sink that Shapes Decoding Strategy in Omni-LLMs
Suho Yoo, Youngjoon Jang, Joon Son Chung

TL;DR
This paper investigates attention sinks in Omni-LLMs, revealing their functional roles and proposing OutRo, a method that improves reasoning in video QA tasks by aligning token representations, with minimal overhead.
Contribution
It uncovers the functional significance of attention sinks and introduces OutRo, a novel approach that enhances reasoning in Omni-LLMs without extra training or multiple passes.
Findings
High sink attention indicates more than redundancy, involving functional roles.
Sink value vectors serve as global biases organizing token representations.
OutRo improves performance on seven video QA benchmarks with minimal decoding overhead.
Abstract
The goal of this paper is to strengthen the reasoning of Omnimodal Large Language Models (Omni-LLMs) at inference time, without additional training. These models jointly process video, audio, and text, and given the large number of tokens they consume, how attention is routed across them is central to their behaviour. We focus specifically on attention sinks, tokens that absorb a disproportionate share of attention mass regardless of their semantic content, to understand how this routing unfolds. To this end, we conduct a systematic analysis of sink behaviour in Omni-LLMs. Our analysis yields two key findings: (i) high sink attention does not solely indicate head redundancy, suggesting that sink value representations play additional functional roles; (ii) the sink value vector acts as a shared bias added to every token's output, serving as a global signal that organises the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
