NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang

TL;DR
This paper identifies that object hallucinations in large vision-language models mainly stem from language priors and introduces a training-free method, NoLan, to dynamically suppress these priors, significantly reducing hallucinations.
Contribution
The paper systematically analyzes the sources of hallucinations in LVLMs and proposes NoLan, a simple, training-free framework that effectively mitigates object hallucinations by suppressing language priors.
Findings
NoLan reduces object hallucinations across various LVLMs.
Significant accuracy improvements on POPE benchmark for LLaVA-1.5 7B and Qwen-VL 7B.
Object hallucinations are mainly caused by language decoder priors.
Abstract
Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training-free framework, No-Language-Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper demonstrates that language priors are one of the primary sources of hallucination errors in LVLMs, rather than deficiencies in visual capability, and substantiates this viewpoint through detailed preliminary experiments. 2. The proposed NoLan framework is elegant, pragmatic, and computationally efficient compared to prior contrastive approaches such as VCD, M3ID, and VDD. The introduction of per-token, KL-divergence-based dynamic modulation in NoLan-Plus represents a more nuanced m
1. I believe that the notion of prior language knowledge is merely one of the illusions at play. As shown in Figure 3, the authors ask image-related questions directly after removing the image, which is not a fair approach. Even humans would not be able to answer correctly without access to the image. 2. The distribution of unimodal logits is extremely uncontrollable. As shown in Figure 1, when we ask about the most common animals, the presence of a large number of words not appearing in the im
* This paper addresses the important issue of hallucination in LVLMs. * It demonstrates strong empirical performance. * The proposed methodology is simple and easy to apply.
* The analysis that LVLM hallucinations are caused by language priors is not entirely new. Prior studies have already identified that hallucinations in LVLMs stem from a strong reliance on language priors [1, 2]. * The paper lacks methodological novelty, as it merely applies minor modifications to existing, extensively studied contrastive decoding-based approaches. * Additionally, the proposed method introduces significant computational overhead at inference time, yet the paper does not include
1. The motivation of this work is reasonable, as text-prior bias is indeed an important cause of hallucinations in large vision-language models. The proposed research to address this issue is therefore meaningful and valuable. 2. The writing is generally good and easy to follow. 3. NoLan-Plus is reasonable and a useful contribution.
1. The comparative experiments are not sufficiently comprehensive. For example, on the POPE benchmark, the authors only compared their method with VCD while overlooking many more recent approaches. 2. While language-prior bias is indeed a reasonable explanation for hallucinations in LVLMs, it may not be a novel finding of this paper. Many prior studies, such as [a, b], have already thoroughly investigated this issue. 3. The proposed method uses the output from text-only inputs as the contrastive
1. The analysis comparing forward vs. reverse KL is meaningful. Since grounded visual information enters as an additional modality, validating grounding via forward KL (from the “no-image” to the “with-image” distribution) is a reasonable choice. 2. Modulating the strength of contrastive decoding as a function of the measured KL divergence is intuitive.
The paper appears highly vulnerable to the well-known drawbacks [1,2,3,4] of contrastive decoding, which can suppress the language prior, often harming text quality and logical coherence. 1. **(Very major)** The language prior embodies an LLM’s fluency and reasoning. Any method that suppresses it must quantify degradation in text quality (e.g., perplexity, LLM-as-judge, or human evaluation). Without such analysis, the approach will be hard to adopt broadly. 2. **(Very major)** When the KL div
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
