See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment

Mohammad Anas Azeez; Ankan Deria; Zohaib Hasan Siddiqui; Adinath Madhavrao Dukre; Rafiq Ali; Sara Atito; Yutong Xie; Imran Razzak

arXiv:2604.09749·cs.CV·April 14, 2026

See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment

Mohammad Anas Azeez, Ankan Deria, Zohaib Hasan Siddiqui, Adinath Madhavrao Dukre, Rafiq Ali, Sara Atito, Yutong Xie, Imran Razzak

PDF

TL;DR

This paper introduces DOP-OBC, a decoding strategy that promotes equitable attention in multimodal models, reducing hallucinations and improving grounding without retraining.

Contribution

It presents a training-free, architecture-agnostic method that balances attention to all objects, enhancing the faithfulness of multimodal model outputs.

Findings

01

Reduces object hallucination on CHAIR and POPE benchmarks.

02

Improves captioning quality in terms of correctness, consistency, and detail.

03

Works across image and video multimodal models without retraining.

Abstract

Multimodal large language models (MLLMs) frequently hallucinate objects that are absent from the visual input, often because attention during decoding is disproportionately drawn to visually dominant or frequently occurring content. We observe that this inequity in attention allocation is a root cause of object hallucination: when rare, small, or contextually peripheral objects receive insufficient attention, the model fails to ground its generation in the full visual scene. We argue that every object in an image, regardless of its size, frequency or visual salience, deserves equal representational opportunity during decoding. To this end, we propose DOP-OBC, a training-free and architecture-agnostic decoding strategy built on the principle of equitable attention. Two complementary object-aware signals work in tandem: a Dominant Object Penalty (DOP) that softly suppresses attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.