TL;DR
The paper introduces First Logit Boosting (FLB), a training-free method that reduces object hallucination in large vision-language models by stabilizing visual grounding during text generation.
Contribution
FLB is a simple, training-free technique that mitigates long-term decay of visual information and hallucinations in LVLMs, with minimal inference overhead.
Findings
FLB significantly reduces object hallucination across various tasks and models.
FLB maintains visual grounding throughout generation, preventing decay.
FLB adds negligible inference overhead, suitable for real-time systems.
Abstract
Recent Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs. However, object hallucination -- the generation of nonexistent objects in answers -- remains a persistent challenge. Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity. Training-free methods such as Contrastive Decoding (CD) are more cost-effective, avoiding additional training or external models, but still suffer from long-term decay, where visual grounding weakens and language priors dominate as the generation progresses. In this paper, we propose First Logit Boosting (FLB), a simple yet effective training-free technique designed to alleviate long-term decay in LVLMs. FLB…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
