OViP: Online Vision-Language Preference Learning for VLM Hallucination
Shujun Liu, Siyuan Wang, Zejun Li, Jianxiang Wang, Cheng Zeng, Zhongyu Wei

TL;DR
This paper introduces OViP, an online learning framework that dynamically trains vision-language models to reduce hallucinations by using model-generated outputs and negative samples, improving alignment and efficiency.
Contribution
The paper presents a novel online, failure-driven training method that constructs contrastive data from the model’s own hallucinated outputs, enhancing hallucination mitigation.
Findings
Significantly reduces hallucinations in vision-language models.
Maintains core multi-modal capabilities while improving training efficiency.
Refines evaluation protocols for better measurement of hallucination suppression.
Abstract
Large vision-language models (LVLMs) remain vulnerable to hallucination, often generating content misaligned with visual inputs. Although recent training-based approaches aim to mitigate hallucination, they typically rely on predefined or randomly edited negative samples that do not reflect actual model errors, thus limiting training efficacy. In this work, we propose an Online Vision-language Preference Learning (OViP) framework that dynamically constructs contrastive training data based on the model's own hallucinated outputs. By identifying semantic differences between sampled response pairs and synthesizing negative images using a diffusion model, OViP generates more relevant supervision signals in real time. This failure-driven training enables adaptive alignment of both textual and visual preferences. Moreover, we refine existing evaluation protocols to better capture the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- This paper addresses a crucial problem in VLMs: hallucinations. The paper propose OViP, which introduces the concepts of "online construction of negative samples" and "joint image-text preference learning," demonstrating good performance on some datasets.
- The method relies on generative models (LLM evaluation, diffusion model for generating negative images)—these steps increase method complexity. The quality of the generated negative images and whether the synthesized negative samples can truly cover the illusion space may limit generalization. Although the authors provide an efficiency analysis, their robustness in large-scale/diverse scenarios is not yet fully demonstrated. - The core components of this method (preference learning, negative s
The experiments are solid, and the writing is clear.
1. The method doesn't have particularly obvious novelty, as similar ideas have been explored before in V-DPO. 2. Missing fine-grained results on general benchmarks; experimental results in this area need to be supplemented. 3. The architecture is limited to LLaVA, and this model is relatively old. It's unclear whether this method would work on newer LVLM architectures.
1. The method builds preference data online from the model’s own mistakes, instead of relying only on fixed, offline edits. This directly targets the model’s real hallucination modes during training and can make the supervision more relevant and efficient. 2. It jointly optimizes both text faithfulness and visual grounding, aiming to reduce hallucination without making the model overly timid or uninformative.
1. The novelty needs to be clarified: similar on-policy training ideas already exist. For example, SIMA[1] also scores the model’s own generated samples and uses them as positive/negative supervision. The paper should explain more clearly what is fundamentally new beyond that prior works. 2. The approach depends on synthetic images from a generative model. But current image generators are not perfectly reliable, so the “negative” images might themselves be noisy or wrong. The paper should justi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEpilepsy research and treatment
MethodsDiffusion
