SAGE: Sink-Aware Grounded Decoding for Multimodal Hallucination Mitigation

Tripti Shukla; Zsolt Kira

arXiv:2603.27898·cs.CV·March 31, 2026

SAGE: Sink-Aware Grounded Decoding for Multimodal Hallucination Mitigation

Tripti Shukla, Zsolt Kira

PDF

TL;DR

SAGE is a decoding framework that reduces hallucinations in vision-language models by adaptively modulating self-attention based on sink tokens, improving grounding and content accuracy.

Contribution

It introduces a novel sink-aware decoding method that dynamically adjusts attention during generation to mitigate hallucinations without retraining or architecture changes.

Findings

01

Achieves an average of 10.65% improvement on MSCOCO

02

Achieves an average of 7.19% improvement on AMBER

03

Consistently outperforms existing decoding strategies in hallucination reduction

Abstract

Large vision-language models (VLMs) frequently suffer from hallucinations, generating content that is inconsistent with visual inputs. Existing methods typically address this problem through post-hoc filtering, additional training objectives, or external verification, but they do not intervene during the decoding process when hallucinations arise. In this work, we introduce SAGE, a Sink-Aware Grounded Decoding framework that mitigates hallucinations by dynamically modulating self-attention during generation. Hallucinations are strongly correlated with attention sink tokens - punctuation or function tokens that accumulate disproportionate attention despite carrying limited semantic content. SAGE leverages these tokens as anchors to monitor grounding reliability in real time. At each sink trigger, the method extracts semantic concepts from the generated sequence, estimates their visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.