Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization

Jiulong Wu; Zhengliang Shi; Shuaiqiang Wang; Jizhou Huang; Dawei Yin; Lingyong Yan; Min Cao; Min Zhang

arXiv:2506.04039·cs.CV·September 23, 2025

Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization

Jiulong Wu, Zhengliang Shi, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao, Min Zhang

PDF

Open Access 1 Video

TL;DR

This paper introduces EMPO, a method to reduce hallucinations in large vision-language models by improving modality alignment and utilizing open-source data, significantly decreasing hallucination rates.

Contribution

The paper proposes Entity-centric Multimodal Preference Optimization (EMPO), a novel approach that enhances modality alignment and leverages open-source datasets to mitigate hallucinations in LVLMs.

Findings

01

EMPO reduces hallucination rates by 85.9% on Object-HalBench.

02

EMPO decreases hallucinations by 49.8% on MM-HalBench.

03

Enhanced modality alignment improves trustworthiness of LVLMs.

Abstract

Large Visual Language Models (LVLMs) have demonstrated impressive capabilities across multiple tasks. However, their trustworthiness is often challenged by hallucinations, which can be attributed to the modality misalignment and the inherent hallucinations of their underlying Large Language Models (LLMs) backbone. Existing preference alignment methods focus on aligning model responses with human preferences while neglecting image-text modality alignment, resulting in over-reliance on LLMs and hallucinations. In this paper, we propose Entity-centric Multimodal Preference Optimization (EMPO), which achieves enhanced modality alignment compared to existing human preference alignment methods. Besides, to overcome the scarcity of high-quality multimodal preference data, we utilize open-source instruction datasets to automatically construct high-quality preference data across three aspects:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Topic Modeling