Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning

Tianyi Bai; Yuxuan Fan; Jiantao Qiu; Fupeng Sun; Jiayi Song; Junlin Han; Zichen Liu; Conghui He; Wentao Zhang; Binhang Yuan

arXiv:2506.07227·cs.CV·June 10, 2025

Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning

Tianyi Bai, Yuxuan Fan, Jiantao Qiu, Fupeng Sun, Jiayi Song, Junlin Han, Zichen Liu, Conghui He, Wentao Zhang, Binhang Yuan

PDF

Open Access

TL;DR

This paper introduces a controlled data generation pipeline and a supervised fine-tuning framework to improve fine-grained visual reasoning in multimodal large language models, reducing hallucinations and enhancing task performance.

Contribution

It presents the Micro Edit Dataset (MED) with minimally edited image pairs and a feature-level consistency loss for better visual embedding stability in MLLMs.

Findings

01

Improved difference detection accuracy on the Micro Edit Detection benchmark.

02

Reduced hallucinations in vision-language tasks.

03

Enhanced performance on image captioning and visual question answering.

Abstract

Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks but still struggle with fine-grained visual differences, leading to hallucinations or missed semantic shifts. We attribute this to limitations in both training data and learning objectives. To address these issues, we propose a controlled data generation pipeline that produces minimally edited image pairs with semantically aligned captions. Using this pipeline, we construct the Micro Edit Dataset (MED), containing over 50K image-text pairs spanning 11 fine-grained edit categories, including attribute, count, position, and object presence changes. Building on MED, we introduce a supervised fine-tuning (SFT) framework with a feature-level consistency loss that promotes stable visual embeddings under small edits. We evaluate our approach on the Micro Edit Detection benchmark, which includes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning