Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization

Miao Pan; Wangjie Gan; Jintao Chen; Wenqi Zhang; Bing Sun; Jianwei Yin; Xuhong Zhang

arXiv:2601.06224·cs.CV·January 14, 2026

Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization

Miao Pan, Wangjie Gan, Jintao Chen, Wenqi Zhang, Bing Sun, Jianwei Yin, Xuhong Zhang

PDF

Open Access 1 Video

TL;DR

This paper identifies key causes of hallucinations in multimodal large language models during reinforcement learning and proposes a comprehensive framework with modules for improved visual localization, exploration, and sample regulation to reduce hallucinations and improve accuracy.

Contribution

The paper introduces a novel, integrated approach combining caption feedback, diversity-aware sampling, and conflict regularization to mitigate hallucinations in MLLMs during RL training.

Findings

01

Significant reduction in hallucination rates.

02

Enhanced inference accuracy of MLLMs.

03

Improved visual localization and exploration strategies.

Abstract

While Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse tasks, their practical deployment is severely hindered by hallucination issues, which become particularly acute during Reinforcement Learning (RL) optimization. This paper systematically analyzes the root causes of hallucinations in MLLMs under RL training, identifying three critical factors: (1) an over-reliance on chained visual reasoning, where inaccurate initial descriptions or redundant information anchor subsequent inferences to incorrect premises; (2) insufficient exploration diversity during policy optimization, leading the model to generate overly confident but erroneous outputs; and (3) destructive conflicts between training samples, where Neural Tangent Kernel (NTK) similarity causes false associations and unstable parameter updates. To address these challenges, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling