When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models
Jiho Choi, Jaemin Kim, Sanghwan Kim, Seunghoon Hong, Jin-Hwi Park

TL;DR
This paper investigates the role of attention sinks in large vision-language models, categorizing them, analyzing their effects, and proposing a dynamic gating method to improve model performance.
Contribution
It introduces a new framework for understanding attention sinks, categorizes them into V-sinks and L-sinks, and proposes Layer-wise Sink Gating to enhance model balance.
Findings
Attention sinks encode global scene priors but can suppress local visual evidence.
Modulating sinks at specific layers significantly impacts downstream performance.
Layer-wise Sink Gating improves multimodal benchmark results without task-specific training.
Abstract
Attention sinks are defined as tokens that attract disproportionate attention. While these have been studied in single modality transformers, their cross-modal impact in Large Vision-Language Models (LVLM) remains largely unexplored: are they redundant artifacts or essential global priors? This paper first categorizes visual sinks into two distinct categories: ViT-emerged sinks (V-sinks), which propagate from the vision encoder, and LLM-emerged sinks (L-sinks), which arise within deep LLM layers. Based on the new definition, our analysis reveals a fundamental performance trade-off: while sinks effectively encode global scene-level priors, their dominance can suppress the fine-grained visual evidence required for local perception. Furthermore, we identify specific functional layers where modulating these sinks most significantly impacts downstream performance. To leverage these insights,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
