Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization

Alberto Compagnoni; Davide Caffagni; Nicholas Moratelli; Lorenzo Baraldi; Marcella Cornia; Rita Cucchiara

arXiv:2508.20181·cs.CV·August 29, 2025

Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization

Alberto Compagnoni, Davide Caffagni, Nicholas Moratelli, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

PDF

Open Access

TL;DR

This paper introduces CHAIR-DPO, a method that reduces hallucinations in multimodal large language models by using the CHAIR metric to guide preference-based fine-tuning, improving answer accuracy without complex synthetic data pipelines.

Contribution

It proposes a novel, simple approach leveraging the CHAIR metric for preference optimization to effectively mitigate hallucinations in MLLMs.

Findings

01

Significant reduction in hallucinated answers across benchmarks.

02

Effective fine-tuning method using CHAIR-based rewards.

03

Open-source code and models available for reproducibility.

Abstract

Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to computer vision. Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate, that is to generate answers to the user's query that are not reflected in the visual input. In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations. In contrast to recent approaches that require complicated pipelines to build synthetic preference data for alignment training, often relying on proprietary models, we capitalize on the well-known CHAIR metric, originally proposed to gauge the degree of hallucinations in image captioning. Given a pair of generated answers, we leverage CHAIR to distinguish winner and…

Tables3

Table 1. Table 1: Comparative evaluation of hallucination mitigation methods for MLLMs, conducted on the AMBER, CHAIR-MSCOCO, and Object HalBench datasets. For each method, we indicate whether the underlying MLLM is fine-tuned (FT) and whether it relies on external support from proprietary LLMs, either via additional training data or post-hoc refinement. Bold-faced and underlined values indicate the best and second-best results.

			AMBER				CHAIR-MSCOCO		Object HalBench
	FT	Ext. Support	${CHAIR}_{i}$ $↓$	Cover $↑$	HalRate $↓$	Cog $↓$	${CHAIR}_{s}$ $↓$	${CHAIR}_{i}$ $↓$	${CHAIR}_{s}$ $↓$	${CHAIR}_{i}$ $↓$
LLaVA-1.5-7B [Liu et al.(2024)Liu, Li, Li, and Lee]	-	-	7.6	51.7	35.0	4.2	50.6	13.9	.	.
+ DoLa [Chuang et al.(2024)Chuang, Xie, Luo, Kim, Glass, and He]	-	-	7.6	51.6	36.0	4.0	51.6	14.1	-	-
+ VCD [Leng et al.(2024)Leng, Zhang, Chen, Li, Lu, Miao, and Bing]	-	-	-	-	-	-	48.6	14.9	48.0	22.3
+ OPERA [Huang et al.(2024)Huang, Dong, Zhang, Wang, He, Wang, Lin, Zhang, and Yu]	-	-	7.3	49.6	32.0	3.5	47.8	14.6	-	-
+ Woodpecker [Yin et al.(2024)Yin, Fu, Zhao, Xu, Wang, Sui, Shen, Li, Sun, and Chen]	-	GPT-3.5	6.9	48.9	30.4	3.6	45.8	14.8	-	-
+ POVID [Zhou et al.(2024)Zhou, Cui, Rafailov, Finn, and Yao]	✓	GPT-4	7.4	51.3	34.3	3.9	-	-	50.7	15.3
+ HA-DPO [Zhao et al.(2023)Zhao, Wang, Ouyang, Dong, Wang, and He]	✓	GPT-4	6.7	49.8	30.9	3.3	38.2	11.0	39.9	19.9
+ HALVA [Sarkar et al.(2025)Sarkar, Ebrahimi, Etemad, Beirami, Arık, and Pfister]	✓	Gemini	6.6	53.0	32.2	3.4	41.4	11.7	-	-
+ EOS [Yue et al.(2024b)Yue, Zhang, and Jin]	✓	-	5.1	49.1	22.7	2.0	36.8	11.3	-	-
+ mDPO [Wang et al.(2024)Wang, Zhou, Huang, Xu, Zhang, Poon, and Chen]	✓	-	4.4	52.4	24.5	2.4	-	-	35.7	9.8
+ MFPO [Jiang et al.(2024)Jiang, Zhang, Chen, Jin, and Liu]	✓	-	4.1	55.7	22.5	1.9	-	-	13.4	6.6
+ REVERSE [Wu et al.(2025)Wu, Lee, Ge, Gonzalez, Darrell, and Chan]	✓	GPT-4	4.0	26.9	10.2	0.9	13.6	6.1	-	-
+ CHAIR-DPO_(β=0.5)	✓	-	3.8	48.7	18.4	1.8	20.8	5.8	20.1	10.9
+ CHAIR-DPO_(β=0.3)	✓	-	3.2	47.4	16.2	1.3	16.4	4.4	13.1	6.7
+ CHAIR-DPO_(β=0.2)	✓	-	3.0	46.6	14.7	1.3	14.4	3.6	8.6	4.6
LLaVA-MORE-8B [Cocchi et al.(2025)Cocchi, Moratelli, Caffagni, Sarto, Cornia, Baraldi, and Cucchiara]	-	-	8.1	53.2	38.4	4.0	51.2	14.4	50.5	24.7
+ DoLa [Chuang et al.(2024)Chuang, Xie, Luo, Kim, Glass, and He]	-	-	7.9	53.1	38.4	4.1	51.8	13.8	-	-
+ Woodpecker [Yin et al.(2024)Yin, Fu, Zhao, Xu, Wang, Sui, Shen, Li, Sun, and Chen]	-	GPT-3.5	7.4	50.7	36.7	3.7	51.0	14.3	-	-
+ REVERSE [Wu et al.(2025)Wu, Lee, Ge, Gonzalez, Darrell, and Chan]	✓	GPT-4	5.1	38.9	20.8	2.1	25.2	8.4	-	-
+ CHAIR-DPO_(β=0.5)	✓	-	3.4	50.7	17.1	1.2	21.0	5.2	19.7	10.0
+ CHAIR-DPO_(β=0.3)	✓	-	2.9	49.4	14.7	1.0	12.3	6.2	16.0	3.8
+ CHAIR-DPO_(β=0.2)	✓	-	2.6	49.7	14.2	1.0	11.8	3.1	9.2	4.5

Table 2. Table 2: Ablation study of the application of data filtering. We assess the impact of filtering out any preference instance where the CHAIR i \text{CHAIR}_{i} difference is zero.

		AMBER				CHAIR-MSCOCO		Object HalBench
	Data Filtering	${CHAIR}_{i}$ $↓$	Cover $↑$	HalRate $↓$	Cog $↓$	${CHAIR}_{s}$ $↓$	${CHAIR}_{i}$ $↓$	${CHAIR}_{s}$ $↓$	${CHAIR}_{i}$ $↓$
LLaVA-1.5-7B [Liu et al.(2024)Liu, Li, Li, and Lee]	-	7.6	51.7	35.0	4.2	50.6	13.9	47.4	25.6
+ CHAIR-DPO_(β=0.5)	✗	4.4	50.1	22.8	2.0	25.6	7.1	19.7	11.1
+ CHAIR-DPO_(β=0.5)	✓	3.8	48.7	18.4	1.8	20.8	5.8	20.1	10.9
+ CHAIR-DPO_(β=0.3)	✗	4.0	49.0	20.2	1.4	17.0	4.9	16.6	8.0
+ CHAIR-DPO_(β=0.3)	✓	3.2	47.4	16.2	1.3	16.4	4.4	13.1	6.7
+ CHAIR-DPO_(β=0.2)	✗	3.4	48.4	17.2	1.2	14.0	3.3	11.3	5.4
+ CHAIR-DPO_(β=0.2)	✓	3.0	46.6	14.7	1.3	14.4	3.6	8.6	4.6
LLaVA-MORE-8B [Cocchi et al.(2025)Cocchi, Moratelli, Caffagni, Sarto, Cornia, Baraldi, and Cucchiara]	-	8.1	53.2	38.4	4.0	51.2	14.4	50.5	24.7
+ CHAIR-DPO_(β=0.5)	✗	3.7	51.5	20.1	1.7	22.8	5.8	19.6	9.3
+ CHAIR-DPO_(β=0.5)	✓	3.4	50.7	17.1	1.2	21.0	5.2	19.7	10.0
+ CHAIR-DPO_(β=0.3)	✗	3.4	51.0	19.0	1.4	18.6	5.0	14.0	7.0
+ CHAIR-DPO_(β=0.3)	✓	2.9	49.4	14.7	1.0	16.0	3.8	12.3	6.2
+ CHAIR-DPO_(β=0.2)	✗	2.9	50.1	16.4	1.4	15.8	4.4	12.6	6.3
+ CHAIR-DPO_(β=0.2)	✓	2.6	49.7	14.2	1.0	11.8	3.1	9.2	4.5

Table 3. Table 3: General cognition evaluation of CHAIR-DPO. We compare LLaVA-1.5-7B and LLaVA-MORE-8B with and without CHAIR-DPO to ensure performance preservation.

	MME		SEED			MMMU	Science-QA	AI2D
	Perception	Cognition	All	Video	Image	Acc	Acc	Acc
InstructBLIP-7B [Dai et al.(2023)Dai, Li, Li, Tiong, Zhao, Wang, Li, Fung, and Hoi]	-	-	53.4	58.8	38.1	-	60.5	-
Qwen-VL-7B [Bai et al.(2023)Bai, Bai, Yang, Wang, Tan, Wang, Lin, Zhou, and Zhou]	-	-	56.3	62.3	39.1	-	67.1	-
Qwen-VL-7B-Chat [Bai et al.(2023)Bai, Bai, Yang, Wang, Tan, Wang, Lin, Zhou, and Zhou]	1487.5	-	58.2	65.4	37.8	-	68.2	-
LLaVA-1.5-LLaMA3-8B [Rasheed et al.(2024a)Rasheed, Maaz, Khan, and Khan]	1544.4	330.3	64.3	42.0	70.1	37.3	74.2	60.7
LLaVA-1.5-7B [Liu et al.(2024)Liu, Li, Li, and Lee]	1474.3	314.6	61.6	42.0	66.8	34.2	69.0	56.4
+ CHAIR-DPO_(β=0.5)	1520.9	372.9	60.8	39.5	66.4	34.9	69.6	55.0
+ CHAIR-DPO_(β=0.3)	1525.9	372.9	60.8	39.7	66.3	35.0	69.3	54.8
+ CHAIR-DPO_(β=0.2)	1518.8	375.0	60.7	39.8	66.2	35.2	69.3	54.7
LLaVA-MORE-8B [Cocchi et al.(2025)Cocchi, Moratelli, Caffagni, Sarto, Cornia, Baraldi, and Cucchiara]	1531.5	353.3	64.1	42.4	69.8	39.4	76.3	61.8
+ CHAIR-DPO_(β=0.5)	1417.3	327.5	64.0	43.0	69.5	37.1	74.4	59.3
+ CHAIR-DPO_(β=0.3)	1414.0	340.7	64.1	43.5	69.5	36.1	74.4	58.9
+ CHAIR-DPO_(β=0.2)	1412.6	335.4	64.1	43.8	69.5	36.8	74.2	59.0

Equations6

CHAIR_{i} (y) = \frac{[hallucinated objects] _{y}}{[all mentioned objects] _{y}} .

CHAIR_{i} (y) = \frac{[hallucinated objects] _{y}}{[all mentioned objects] _{y}} .

y_{w} = y \in {y_{1}, y_{2}} min CHAIR_{i} (y), y_{l} = y \in {y_{1}, y_{2}} max CHAIR_{i} (y), with y_{1}, y_{2} \sim π_{ref} (y ∣ x_{T}, x_{I}) .

y_{w} = y \in {y_{1}, y_{2}} min CHAIR_{i} (y), y_{l} = y \in {y_{1}, y_{2}} max CHAIR_{i} (y), with y_{1}, y_{2} \sim π_{ref} (y ∣ x_{T}, x_{I}) .

L_{DPO} (π_{θ}; π_{ref}) = - E (x_{T}, x_{I}, y_{w}, y_{l}) \sim D [lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x _{T} , x _{I} )}{π _{ref} ( y _{w} ∣ x _{T} , x _{I} )} - β lo g \frac{π _{θ} ( y _{l} ∣ x _{T} , x _{I} )}{π _{ref} ( y _{l} ∣ x _{T} , x _{I} )})],

L_{DPO} (π_{θ}; π_{ref}) = - E (x_{T}, x_{I}, y_{w}, y_{l}) \sim D [lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x _{T} , x _{I} )}{π _{ref} ( y _{w} ∣ x _{T} , x _{I} )} - β lo g \frac{π _{θ} ( y _{l} ∣ x _{T} , x _{I} )}{π _{ref} ( y _{l} ∣ x _{T} , x _{I} )})],

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEEG and Brain-Computer Interfaces · Formal Methods in Verification · Semantic Web and Ontologies

Full text

\addauthor

Alberto [email protected] \addauthorDavide [email protected] \addauthorNicholas [email protected] \addauthorLorenzo [email protected] \addauthorMarcella [email protected] \addauthorRita [email protected] \addinstitution University of Modena and Reggio Emilia

Modena, Italy

Mitigating Hallucinations in Multimodal LLMs

Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization

Abstract

Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to computer vision. Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate, that is to generate answers to the user’s query that are not reflected in the visual input. In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations. In contrast to recent approaches that require complicated pipelines to build synthetic preference data for alignment training, often relying on proprietary models, we capitalize on the well-known CHAIR metric, originally proposed to gauge the degree of hallucinations in image captioning. Given a pair of generated answers, we leverage CHAIR to distinguish winner and loser options (i.e\bmvaOneDot, non-hallucinated and hallucinated samples) and fine-tune off-the-shelf MLLMs via Direct Preference Optimization (DPO). The resulting method, which we refer to as CHAIR-DPO, effectively diminishes the amount of hallucinated answers on several hallucination benchmarks, demonstrating the effectiveness of fine-tuning the MLLM with a CHAIR-based reward. Source code and trained models are publicly available at https://github.com/aimagelab/CHAIR-DPO.

1 Introduction

Research interest in Multimodal Large Language Models (MLLMs) is raging. By capitalizing on massive self-supervised pre-training on text, images [Liu et al.(2023)Liu, Li, Wu, and Lee, Liu et al.(2024)Liu, Li, Li, and Lee, Bai et al.(2023)Bai, Bai, Yang, Wang, Tan, Wang, Lin, Zhou, and Zhou, Ye et al.(2024)Ye, Xu, Ye, Yan, Hu, Liu, Qian, Zhang, and Huang, Laurençon et al.(2024)Laurençon, Marafioti, Sanh, and Tronchon], and possibly other modalities [Panagopoulou et al.(2023)Panagopoulou, Xue, Yu, Li, Li, Joty, Xu, Savarese, Xiong, and Niebles, Han et al.(2024)Han, Gong, Zhang, Wang, Zhang, Lin, Qiao, Gao, and Yue, Sun et al.(2024a)Sun, Cui, Zhang, Zhang, Yu, Wang, Rao, Liu, Huang, and Wang], they emerge as a unified interface the user can interact with to solve different problems [Caffagni et al.(2024)Caffagni, Cocchi, Barsellotti, Moratelli, Sarto, Baraldi, Baraldi, Cornia, and Cucchiara]. Not only they are accessible to even non-expert users, as the interaction happens via natural language, but they deliver strong performance on several tasks, often challenging specialized models tailored to a unique task alone. The capabilities of an MLLM goes beyond the production of text in response to a user query, spanning from visual grounding and referring [Peng et al.(2023)Peng, Wang, Dong, Hao, Huang, Ma, and Wei, Chen et al.(2023)Chen, Zhang, Zeng, Zhang, Zhu, and Zhao, Rasheed et al.(2024b)Rasheed, Maaz, Shaji, Shaker, Khan, Cholakkal, Anwer, Xing, Yang, and Khan], up to the generation of images and videos [Sun et al.(2024b)Sun, Yu, Cui, Zhang, Zhang, Wang, Gao, Liu, Huang, and Wang, Sun et al.(2024a)Sun, Cui, Zhang, Zhang, Yu, Wang, Rao, Liu, Huang, and Wang, Team(2024)].

Despite their impressive capabilities, MLLMs still suffer from hallucinations – generating content that is unsupported by the input. Hallucination is a long-standing problem widely studied in the natural language processing community [Xu et al.(2024)Xu, Jain, and Kankanhalli, Huang et al.(2025)Huang, Yu, Ma, Zhong, Feng, Wang, Chen, Peng, Feng, Qin, et al.], but having access to complementary data modalities other than text stretches it out and opens new paths for the model to hallucinate [Sahoo et al.(2024)Sahoo, Meharia, Ghosh, Saha, Jain, and Chadha]. For instance, visual hallucinations [Rohrbach et al.(2018)Rohrbach, Hendricks, Burns, Darrell, and Saenko, Wang et al.(2023)Wang, Wang, Xu, Zhang, Gu, Jia, Wang, Xu, Yan, Zhang, et al., Huang et al.(2024)Huang, Dong, Zhang, Wang, He, Wang, Lin, Zhang, and Yu, Yue et al.(2024b)Yue, Zhang, and Jin, Yin et al.(2024)Yin, Fu, Zhao, Xu, Wang, Sui, Shen, Li, Sun, and Chen] manifest whenever an MLLM mentions an object not depicted in the input image.

An intuitive framework for reducing hallucinations is to view them as a human alignment problem: as we humans reasonably prefer answers devoid of hallucinations, so should an MLLM properly aligned to human preference. Unfortunately, established techniques employed in developing MLLMs, such as visual instruction tuning [Liu et al.(2023)Liu, Li, Wu, and Lee, Liu et al.(2024)Liu, Li, Li, and Lee], Reinforcement Learning from Human Feedback (RLHF) [Ouyang et al.(2022)Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, et al.], or Direct Preference Optimization (DPO) [Rafailov et al.(2023)Rafailov, Sharma, Mitchell, Manning, Ermon, and Finn], often prioritize that the generated answer effectively fulfills the user query, overlooking whether it contains hallucinations or not. Since they are implemented as the last training stage of MLLMs, they play the most important role in aligning the model to human preference.

Recent research [Zhao et al.(2023)Zhao, Wang, Ouyang, Dong, Wang, and He, Wang et al.(2024)Wang, Zhou, Huang, Xu, Zhang, Poon, and Chen, Jiang et al.(2024)Jiang, Zhang, Chen, Jin, and Liu, Sarkar et al.(2025)Sarkar, Ebrahimi, Etemad, Beirami, Arık, and Pfister, Wu et al.(2025)Wu, Lee, Ge, Gonzalez, Darrell, and Chan] has introduced alignment methods to steer MLLMs toward preferring non-hallucinated outputs, with a particular focus on DPO. DPO is especially appealing as it replaces the complexity of reinforcement learning with a more tractable supervised learning approach. A key challenge prior to its application, however, is to collect preference data about hallucinations. In other words, how can we know which answer generated by an MLLM is hallucinated and which is not? Resorting to human annotators is costly and does not scale, so a compelling alternative becomes to query cutting-edge proprietary MLLMs such as GPT-4 [Zhao et al.(2023)Zhao, Wang, Ouyang, Dong, Wang, and He, Zhou et al.(2024)Zhou, Cui, Rafailov, Finn, and Yao, Wu et al.(2025)Wu, Lee, Ge, Gonzalez, Darrell, and Chan] and Gemini [Sarkar et al.(2025)Sarkar, Ebrahimi, Etemad, Beirami, Arık, and Pfister] to act as a judge.

In this work, we introduce CHAIR-DPO, a new preference optimization method to tackle visual hallucinations in MLLMs. CHAIR-DPO builds on top of CHAIR [Rohrbach et al.(2018)Rohrbach, Hendricks, Burns, Darrell, and Saenko], a well-known metric proposed to assess the degree of hallucination of image captioning models [Sarto et al.(2025)Sarto, Cornia, and Cucchiara, Barraco et al.(2023)Barraco, Sarto, Cornia, Baraldi, and Cucchiara, Petryk et al.(2024)Petryk, Chan, Kachinthaya, Zou, Canny, Gonzalez, and Darrell]. The key idea is to leverage CHAIR to quantitatively measure hallucinations in the responses generated by an MLLM, and thus select as preferred the response with the lower hallucination rate. After collecting enough preference pairs, we apply DPO to fine-tune an MLLM, enhancing its awareness concerning the presence or absence of objects in the input image, and drastically reducing visual hallucinations. An overview of the proposed approach is outlined in Figure 1.

We argue that CHAIR enables more efficient preference data collection for hallucination mitigation compared to existing approaches. Experiments on multiple hallucination benchmarks, such as AMBER [Wang et al.(2023)Wang, Wang, Xu, Zhang, Gu, Jia, Wang, Xu, Yan, Zhang, et al.], CHAIR-MSCOCO [Yue et al.(2024b)Yue, Zhang, and Jin], and Object HalBench [Rohrbach et al.(2018)Rohrbach, Hendricks, Burns, Darrell, and Saenko, Yu et al.(2024)Yu, Yao, Zhang, He, Han, Cui, Hu, Liu, Zheng, Sun, et al.], show that CHAIR-DPO achieves state-of-the-art performance, significantly reducing hallucination rates without degrading the original capabilities of the underlying MLLM.

2 Related Work

Multimodal Large Language Models. Building on the advancements of Large Language Models (LLMs), interest is surging to extend LLMs to the multimodal domain [Liu et al.(2024)Liu, Li, Li, and Lee, Bai et al.(2023)Bai, Bai, Yang, Wang, Tan, Wang, Lin, Zhou, and Zhou, Sun et al.(2024b)Sun, Yu, Cui, Zhang, Zhang, Wang, Gao, Liu, Huang, and Wang, Chen et al.(2023)Chen, Zhang, Zeng, Zhang, Zhu, and Zhao, Ye et al.(2024)Ye, Xu, Ye, Yan, Hu, Liu, Qian, Zhang, and Huang, Laurençon et al.(2024)Laurençon, Marafioti, Sanh, and Tronchon, Sun et al.(2024a)Sun, Cui, Zhang, Zhang, Yu, Wang, Rao, Liu, Huang, and Wang, Rasheed et al.(2024b)Rasheed, Maaz, Shaji, Shaker, Khan, Cholakkal, Anwer, Xing, Yang, and Khan, Team(2024)], with most of the effort devoted to the integration of visual understanding. MLLMs perceive different modalities thanks to external encoders (e.g\bmvaOneDot, CLIP visual encoder [Radford et al.(2021)Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, et al.] for images) that transform (unimodal) data into modality-specific embeddings that can be understood by the LLM upon the application of an adapter module. In this work, we utilize open-source MLLMs [Liu et al.(2024)Liu, Li, Li, and Lee, Cocchi et al.(2025)Cocchi, Moratelli, Caffagni, Sarto, Cornia, Baraldi, and Cucchiara] built on the popular LLaVA [Liu et al.(2023)Liu, Li, Wu, and Lee, Liu et al.(2024)Liu, Li, Li, and Lee] framework. These models feature a CLIP-ViT-L/14@336 [Radford et al.(2021)Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, et al.] as the visual encoder and a lightweight feed-forward network as the vision-to-language adapter. LLaVA initially trains only the adapter module on an image captioning dataset. Next, it also unfreezes the pre-trained LLM and jointly optimizes it on a visual instruction tuning dataset comprising multi-turn visual dialogues. The parameters of the CLIP visual encoder are kept frozen and never updated.

Direct Preference Optimization (DPO). DPO [Rafailov et al.(2023)Rafailov, Sharma, Mitchell, Manning, Ermon, and Finn] has been established as a compelling alternative to Reinforcement Learning from Human Feedback (RLHF) [Ouyang et al.(2022)Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, et al.] to align fine-tuned LLMs with human preferences. DPO presents two main advantages against RLHF. First, it saves the effort of training a reward model to mimic human preferences, and second, it completely bypasses the reinforcement learning stage, which is non-trivial to implement in practice. The intuition behind DPO is to frame the constrained minimization problem of RLHF such that the objective does not rely on the reward gained by the model anymore, but instead, it depends upon an implicit reward expressed in terms of the optimizing and reference policies. Originally born for LLMs only, DPO and its variants [Song et al.(2024)Song, Yu, Li, Yu, Huang, Li, and Wang, Hong et al.(2024)Hong, Lee, and Thorne, Wu et al.(2024)Wu, Xie, Yang, Wu, Gao, Ding, Wang, and He] are now successfully applied for aligning MLLMs as well [Jia et al.(2024)Jia, Jiang, Xu, Ye, Dong, Yan, Zhang, Huang, and Zhang, Zhang et al.(2024)Zhang, Gui, Sun, Feng, Xu, Zhang, Fu, Li, Hauptmann, Bisk, et al.], not to mention different generative domains spanning from diffusion models [Wallace et al.(2024)Wallace, Dang, Rafailov, Zhou, Lou, Purushwalkam, Ermon, Xiong, Joty, and Naik] to image captioning [Moratelli et al.(2024)Moratelli, Caffagni, Cornia, Baraldi, and Cucchiara].

Hallucination Mitigation. Despite the success of MLLMs, hallucinations remain an open challenge. Recent research [Bai et al.(2024)Bai, Wang, Xiao, He, Han, Zhang, and Shou] identifies the root of hallucinations in noisy or corrupted training samples, eventually exacerbated by the addition of synthetic training data, as well as an imbalance in the frequency of certain entities appearing more often than others during optimization. This is further compounded by the overwhelming dominance of the LLM compared to the size of the visual encoder, which may lead to an over-reliance on linguistic priors rather than visual grounding. To mitigate hallucinations, prior work explores training-free and training-based approaches. Training-free methods [Chuang et al.(2024)Chuang, Xie, Luo, Kim, Glass, and He, Leng et al.(2024)Leng, Zhang, Chen, Li, Lu, Miao, and Bing, Huang et al.(2024)Huang, Dong, Zhang, Wang, He, Wang, Lin, Zhang, and Yu] modify decoding strategies or apply post-hoc corrections [Yin et al.(2024)Yin, Fu, Zhao, Xu, Wang, Sui, Shen, Li, Sun, and Chen] using external models. Training-based approaches fine-tune MLLMs via variants of preference alignment losses [Zhao et al.(2023)Zhao, Wang, Ouyang, Dong, Wang, and He, Wang et al.(2024)Wang, Zhou, Huang, Xu, Zhang, Poon, and Chen, Jiang et al.(2024)Jiang, Zhang, Chen, Jin, and Liu, Zhou et al.(2024)Zhou, Cui, Rafailov, Finn, and Yao, Sarkar et al.(2025)Sarkar, Ebrahimi, Etemad, Beirami, Arık, and Pfister] or modified cross-entropy objectives [Yue et al.(2024b)Yue, Zhang, and Jin, Wu et al.(2025)Wu, Lee, Ge, Gonzalez, Darrell, and Chan]. Notably, HA-DPO [Zhao et al.(2023)Zhao, Wang, Ouyang, Dong, Wang, and He] augments DPO with auxiliary language modeling loss component, while HALVA [Sarkar et al.(2025)Sarkar, Ebrahimi, Etemad, Beirami, Arık, and Pfister] uses a fine-grained feedback to only penalize hallucinated tokens at phrase level. Lastly, both mDPO [Wang et al.(2024)Wang, Zhou, Huang, Xu, Zhang, Poon, and Chen] and MFPO [Jiang et al.(2024)Jiang, Zhang, Chen, Jin, and Liu] incorporate an image preference loss and anchor terms to preserve visual grounding and prevent reward degradation. Crucially, most of these approaches create or use datasets derived from complicated pipelines, often leveraging proprietary MLLMs. In contrast, our method relies solely on (i) an open-source model, the same one being fine-tuned, used exclusively for sampling completions, and (ii) an off-the-shelf object detector, drastically simplifying preference data collection.

3 Proposed Method

3.1 Problem Formulation

In our settings, an MLLM conditioned on textual prompt $x_{T}$ and image $x_{I}$ outputs the continuation text $y$ with probability $P(y\mid x_{T},x_{I})$ . We call $y$ a hallucinated answer and denote it as $y_{l}$ , whenever $y$ mentions any object not present in $x_{I}$ . Building on the intuition that humans should prefer a non-hallucinated answer $y_{w}$ more than a hallucinated one, we seek to align the model with this preference, so that the probability of generating $y_{w}$ gets higher than $y_{l}$ .

3.2 Collecting Preference Data for Object Hallucinations

To align our model with human preferences regarding object hallucinations, we need a supervised signal that can distinguish between non-hallucinated and hallucinated responses. In this work, we create this signal with CHAIR [Rohrbach et al.(2018)Rohrbach, Hendricks, Burns, Darrell, and Saenko], a metric designed to assess the hallucination extent of image captioning models. We summarize in Figure 2 our data collection process. Given an answer $y$ generated by an MLLM, we define its $\text{CHAIR}_{i}$ score as the fraction of hallucinated object instances mentioned in $y$ :

[TABLE]

Note that the hallucinated and mentioned objects in Eq. 1 are not treated as a set, but rather as a list. It follows that $\text{CHAIR}_{i}$ can penalize a sentence for hallucinating the same objects multiple times. Normally, $\text{CHAIR}_{i}$ is applied on a supervised dataset where the objects mentioned in a ground-truth caption are known in advance. As we do not have access to similar annotations concerning hallucination, we propose to adapt publicly available datasets commonly used for visual instruction tuning of MLLMs [Liu et al.(2024)Liu, Li, Li, and Lee]. Such a dataset can be seen as a collection of triplets $\{x_{T},x_{I},y\}$ , where $y$ is the ground-truth answer conditioned on the prompt $x_{T}$ and image $x_{I}$ . In case of multi-turn dialogs, we randomly truncate the conversation, and provide the model with prior turns along with the current question as textual context.

First of all, for each triplet in the dataset, we discard $y$ , and sample a new pair of possible answers $\{y_{1},y_{2}\}$ from a reference instruction-tuned MLLM $\pi_{\text{ref}}$ . We then apply an off-the-shelf detector trained to recognize a predefined set of objects to the image $x_{I}$ , treating the class name of the detected items as the ground-truth set of objects present in $x_{I}$ . At this point, it is possible to compute the $\text{CHAIR}_{i}$ score of the generated answers, considering as hallucinated any mentioned object that is not in the ground-truth set. To make the evaluation robust, we leverage a predefined set of synonyms to match the words in the generated answers with the class names of the ground-truth objects. We can now designate the non-hallucinated (i.e\bmvaOneDot, winner) and hallucinated (i.e\bmvaOneDot, loser) answers as:

[TABLE]

3.3 Object-Aware Preference Optimization

With our preference data established, we apply Direct Preference Optimization (DPO) [Rafailov et al.(2023)Rafailov, Sharma, Mitchell, Manning, Ermon, and Finn], an effective training approach for the human alignment of MLLMs. DPO bypasses the need for reinforcement learning to explicitly maximize a reward signal that scores the completions generated by the model. Instead, DPO simultaneously trains the policy model (i.e\bmvaOneDot, the chosen MLLM) along with an implicit reward model that assigns a higher score to a winner answer $y_{w}$ compared to the loser answer $y_{l}$ :

[TABLE]

where $\beta$ is a regularizing hyperparameter controlling the strength of the Kullback-Leibler divergence, expressing the degree to which the policy model $\pi_{\theta}$ must be tied to a frozen reference model $\pi_{\text{ref}}$ . In practice, $\pi_{\theta}$ is initialized with the same weights as $\pi_{\text{ref}}$ . Note that our method may be applied iteratively: after exhausting the dataset once, fresh preference data can be built following Sec. 3.2, upon updating $\pi_{\text{ref}}$ with $\pi_{\theta}$ .

Minimizing Eq. 3 pulls up the probability of generating the non-hallucinated completion $\pi_{\theta}(y_{w}\mid x_{T},x_{I})$ to the detriment of the hallucinated one $\pi_{\theta}(y_{l}\mid x_{T},x_{I})$ . Succeeding in that has the immediate consequence of making $\pi_{\theta}$ aware of what objects really appear in the presented image, and what are missing instead. We refer to the proposed method as CHAIR-DPO.

Data Filtering for Reliable Preference Supervision. A critical challenge we observe during training stems from the presence of several completion pairs with indistinguishable $\text{CHAIR}_{i}$ scores, making it impossible to assign winner and loser labels. Incorporating such pairs into the optimization process introduces noisy supervision, as the model is forced to learn from randomly chosen preference labels. To address this, we introduce a simple yet effective filtering strategy: we discard all training instances where the $\text{CHAIR}_{i}$ score difference between completions is zero. This ensures that the retained supervision pairs reflect a meaningful distinction in object hallucination severity, leading to a more reliable optimization signal.

4 Experiments

4.1 Experimental Setting

Implementation and Training Details. We build hallucination preference data starting from LLaVA-Instruct-665k [Liu et al.(2024)Liu, Li, Li, and Lee], an open-source and widely employed dataset for visual instruction tuning, comprising multi-turn dialogues. We ask our chosen MLLMs, namely LLaVA-1.5-7B [Liu et al.(2024)Liu, Li, Li, and Lee] and LLaVA-MORE-8B [Cocchi et al.(2025)Cocchi, Moratelli, Caffagni, Sarto, Cornia, Baraldi, and Cucchiara]111Specifically, LLaVA-1.5-7B is based on Vicuna-7B [Chiang et al.(2023)Chiang, Li, Lin, Sheng, Wu, Zhang, Zheng, Zhuang, Zhuang, Gonzalez, Stoica, and Xing], while LLaVA-MORE-8B is based on LLaMA-3.1-8B [Dubey et al.(2024)Dubey, Jauhri, Pandey, Kadian, Al-Dahle, Letman, Mathur, Schelten, Yang, Fan, et al.]. Both are trained following the two-stage training pipeline introduced in [Liu et al.(2024)Liu, Li, Li, and Lee]., to generate the candidate completions $\{y_{1},y_{2}\}$ conditioned on the image and the first $k$ human-assistant turns, where $k$ is randomly chosen. Next, we identify winner and loser candidates by computing the $\text{CHAIR}_{i}$ score, using as ground-truth objects those detected by DETR-DC5-R101 [Carion et al.(2020)Carion, Massa, Synnaeve, Usunier, Kirillov, and Zagoruyko]. We employ class names from MSCOCO [Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick] using the list of synonyms provided in the reference implementation of CHAIR [Rohrbach et al.(2018)Rohrbach, Hendricks, Burns, Darrell, and Saenko]. After filtering out all the instances with null $\text{CHAIR}_{i}$ difference, we end up with 70k and 77k training samples for LLaVA-1.5-7B and LLaVA-MORE-8B, respectively. We refer to Appendix A.1 for more details on the DPO fine-tuning stage, carried out with LoRA [Hu et al.(2021)Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, and Chen] for efficiency and performance preservation.

Datasets and Evaluation Benchmarks. We evaluate the performance of CHAIR-DPO on three popular hallucination-oriented benchmarks: AMBER [Wang et al.(2023)Wang, Wang, Xu, Zhang, Gu, Jia, Wang, Xu, Yan, Zhang, et al.], CHAIR-MSCOCO [Yue et al.(2024b)Yue, Zhang, and Jin], and Object HalBench [Rohrbach et al.(2018)Rohrbach, Hendricks, Burns, Darrell, and Saenko, Yu et al.(2024)Yu, Yao, Zhang, He, Han, Cui, Hu, Liu, Zheng, Sun, et al.]. AMBER is designed to assess the visual hallucination tendencies of MLLMs without relying on external LLMs for response annotation or judgment. Following recent works [Jiang et al.(2024)Jiang, Zhang, Chen, Jin, and Liu, Wang et al.(2024)Wang, Zhou, Huang, Xu, Zhang, Poon, and Chen], we focus on the generative task, evaluating hallucination proneness using the $\text{CHAIR}_{i}$ , Hallucination Rate (HalRate), and Cognition (Cog) metrics, and assessing object recall via the Coverage (Cover) metric. The $\text{CHAIR}_{i}$ score is computed as described in Eq. 1, while HalRate corresponds to the $\text{CHAIR}_{s}$ metric [Rohrbach et al.(2018)Rohrbach, Hendricks, Burns, Darrell, and Saenko], defined as the percentage of responses that contain at least one hallucinated object. CHAIR-MSCOCO builds on similar principles but operates on a subset of the MSCOCO dataset [Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick]. It evaluates hallucinations using $\text{CHAIR}_{i}$ and $\text{CHAIR}_{s}$ , leveraging the predefined MSCOCO object classes as a reference vocabulary and extracting mentioned objects with a standard NLP toolkit, in line with the official CHAIR implementation [Rohrbach et al.(2018)Rohrbach, Hendricks, Burns, Darrell, and Saenko]. Object HalBench also reports $\text{CHAIR}_{i}$ and $\text{CHAIR}_{s}$ scores but uses a different subset of the MSCOCO validation set. Unlike CHAIR-MSCOCO, it employs a proprietary LLM (i.e\bmvaOneDot, GPT-3.5) to extract mentioned objects, which improves the precision and recall of object identification. Further details about datasets and metrics are provided in Appendix A.2.

4.2 Experimental Results

Comparison with the State of the Art. We compare in Table 1 our fine-tuned models against a multitude of recent approaches to alleviate hallucinations in MLLMs. Many of them resort to proprietary MLLMs to collect preference data for fine-tuning [Zhao et al.(2023)Zhao, Wang, Ouyang, Dong, Wang, and He, Sarkar et al.(2025)Sarkar, Ebrahimi, Etemad, Beirami, Arık, and Pfister, Wu et al.(2025)Wu, Lee, Ge, Gonzalez, Darrell, and Chan] or design sophisticated data pipelines to craft preferred and dispreferred completions [Jiang et al.(2024)Jiang, Zhang, Chen, Jin, and Liu]. Conversely, CHAIR-DPO only requires the offline application of an off-the-shelf detector and the computation of the $\text{CHAIR}_{i}$ , which is widely known in the captioning literature and does not need any engineering.

We notice that both LLaVA-1.5-7B and LLaVA-MORE-8B combined with CHAIR-DPO always achieve the best results, or the second-best at worst, on metrics directly assessing the hallucination degree. Moreover, we can slightly adapt the behavior of CHAIR-DPO by varying the $\beta$ hyperparameter, which controls the strength of the Kullback-Leibler divergence in DPO (cf. Eq. 3). Specifically, lower $\beta$ values soften this regularization, achieving lower hallucination rates at the expense of slightly worse Coverage scores on AMBER. This pattern of trading off Coverage for hallucination is consistent in all the considered models. However, CHAIR-DPO manages to strike a good balance between the two metrics, in contrast to other methods. For instance, REVERSE reaches a lower HalRate than $\text{CHAIR-DPO}_{\beta=0.2}$ with LLaVA-1.5-7B on AMBER (10.2 vs 14.7), but its Coverage falls to 26.9, recording a severe -24.8 points drop with respect to the 51.7 points of the baseline model, while for $\text{CHAIR-DPO}_{\beta=0.2}$ the degradation is limited to -5.1 points.

The comparison with LLaVA-MORE-8B is even more favorable to CHAIR-DPO, in that it scores the best hallucination results. Training-free methods that do not modify the weights of LLaVA-MORE-8B, such as DoLa [Chuang et al.(2024)Chuang, Xie, Luo, Kim, Glass, and He] and Woodpecker [Yin et al.(2024)Yin, Fu, Zhao, Xu, Wang, Sui, Shen, Li, Sun, and Chen], are superior in terms of Coverage, but they greatly fall short against CHAIR-DPO in terms of hallucination metrics.

Ablation Studies. We assess the impact of filtering out instances where the $\text{CHAIR}_{i}$ difference between the two candidate completions is null, presenting the results in Table 2. When data filtering is on, CHAIR-DPO records the best hallucination scores concerning $\text{CHAIR}_{i}$ and HalRate on AMBER. The same effect is observable on the Cognition metric in AMBER and in general on CHAIR-MSCOCO, except for LLaVA-1.5-7B with $\beta=0.2$ , which performs slightly better without the filter. With LLaVA-MORE-8B, the data filtering benefits are consistent on AMBER and CHAIR-MSCOCO, while we notice a minimal degradation on Object HalBench when $\beta$ is set to 0.5. We impute it to the strong regularization imposed by $\beta=0.5$ during the fine-tuning, which limits how much the model can learn from severe changes in the hallucination level measured by $\text{CHAIR}_{i}$ . Finally, we notice the same pattern related to the Coverage metrics on AMBER as in Table 1, where models trade off better hallucination rates for lower Coverage. These results suggest that the proposed data filtering approach is well-suited for CHAIR-DPO, with the added benefit of significantly speeding up fine-tuning by eliminating nearly 90% of the dataset.

Performance Preservation Analysis. An important research question is whether CHAIR-DPO hurts the general cognitive capabilities of the reference MLLM, i.e\bmvaOneDot $\pi_{\text{ref}}$ (cf. Eq. 3). To address this point, we test CHAIR-DPO on several benchmarks typically employed in a comprehensive evaluation of MLLMs [Liu et al.(2024)Liu, Li, Li, and Lee]. These benchmarks include MME [Fu et al.(2023)Fu, Chen, Shen, Qin, Zhang, Lin, Yang, Zheng, Li, Sun, et al.], which measures capabilities in 14 distinct multimodal interaction categories, SEED-Bench [Li et al.(2023)Li, Wang, Wang, Ge, Ge, and Shan], which examines 12 cognitive aspects spanning from visual comprehension to text recognition via 19k expert-developed questions, MMMU [Yue et al.(2024a)Yue, Ni, Zhang, Zheng, Liu, Zhang, Stevens, Jiang, Ren, Sun, et al.], which challenges models with specialized academic content drawn from collegiate curricula, Science-QA [Lu et al.(2022)Lu, Mishra, Xia, Qiu, Chang, Zhu, Tafjord, Clark, and Kalyan], which probes multidisciplinary reasoning across scientific domains with structured question formats, and AI2D [Kembhavi et al.(2016)Kembhavi, Salvato, Kolve, Seo, Hajishirzi, and Farhadi], which assesses visual-scientific literacy through diagram interpretation exercises. The results are presented in Table 3. We highlight that CHAIR-DPO even improves LLaVA-1.5-7B on MME, MMMU, and Science-QA, while scoring comparable to the reference model and other open-source MLLMs on SEED and AI2D. Notably, this behavior is consistent no matter the strength of the $\beta$ regularizer. On the other hand, LLaVA-MORE-8B records a reasonable penalty on most benchmarks, even though CHAIR-DPO enhances its performance on SEED-Video. We argue that this regression on general benchmarks is more than acceptable given the notable reduction in hallucinations testified by Table 1. Conversely, we conclude that CHAIR-DPO does not cause a catastrophic forgetting of the knowledge acquired by LLaVA models during visual instruction tuning. We credit for this the Kullback-Leibler regularization of DPO, controlled by $\beta$ , as well as the efficient fine-tuning enabled by LoRA, which results in a minimal change to the parameters of the baseline model.

Qualitative Results. Finally, Figure 3 presents captions generated by LLaVA-1.5-7B and LLaVA-MORE-8B [Cocchi et al.(2025)Cocchi, Moratelli, Caffagni, Sarto, Cornia, Baraldi, and Cucchiara] conditioned on the prompt: Describe the image, before and after fine-tuning with CHAIR-DPO. As it can be seen, LLaVA-1.5-7B [Liu et al.(2023)Liu, Li, Wu, and Lee] generates a reasonable description of the first picture, but then mentions the erroneous presence of two other people and a chair in the same room. CHAIR-DPO effectively avoids such hallucinations, and rather adds the detail of the glasses worn by the woman. A similar pattern is repeated with LLaVA-MORE-8B, which hallucinates a total of four alleged people, as well as a pair of benches. Conversely, CHAIR-DPO correctly identifies that no other vehicles nor pedestrian appear in the image other than the double-decker bus, further confirming the effectiveness of the proposed method in mitigating hallucinations. Additional qualitative results of both models are shown in Appendix B.

5 Conclusion

In this work, we introduced CHAIR-DPO, a novel preference optimization method for mitigating hallucinations in Multimodal Large Language Models. By leveraging the well-established CHAIR metric to build preference data for DPO training, our approach achieves state-of-the-art performance across multiple hallucination benchmarks while requiring only an off-the-shelf object detector in the data collection stage. Unlike existing approaches that rely on complex pipelines or proprietary MLLMs to generate preference data, CHAIR-DPO provides a simpler yet effective alignment framework. Our experiments demonstrate that CHAIR-DPO not only reduces hallucinations significantly but also preserves the general capabilities of the baseline models, with minimal regression on standard benchmarks. The effectiveness of our method suggests that object awareness is a critical component in developing MLLMs that produce factually accurate responses grounded in visual inputs.

Acknowledgments

We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources. This work has been supported by the EU Horizon projects “ELIAS” (GA No. 101120237) and “ELLIOT” (GA No. 101214398), by the EuroHPC JU project “MINERVA” (GA No. 101182737), and by the PRIN project “MUSMA” (CUP G53D23002930006 - M4C2 I1.1), funded by the EU - NextGenerationEU.

Appendix A Experimental Settings

A.1 Additional Implementation Details

To collect our preference data we sample responses from both models using a temperature of 0.7. For both the full and filtered preference datasets, we hold out 500 samples to serve as a validation set. The final checkpoint is selected based on the lowest $\text{CHAIR}_{i}$ score observed on this validation set where the score is computed using micro averaging, that is, by aggregating all hallucinated and mentioned objects across the dataset before applying Eq. 1. For training, we employ LoRA with a rank of 128 and an $\alpha$ of 256. We use the Adam [Kingma(2014)] optimizer, along with a cosine learning rate scheduler. The peak learning rate is set to $2\times 10^{-6}$ , with 33 warmup steps for LLaVA-1.5-7B and 145 warmup steps for LLaVA-MORE-8B. Training is conducted in a distributed, multi-node, multi-GPU environment consisting of 2 nodes with 8 NVIDIA A100 64GB GPUs each. We utilize DeepSpeed ZeRO Stage 2 [Rajbhandari et al.(2020)Rajbhandari, Rasley, Ruwase, and He] and gradient checkpointing to optimize memory usage. This setup allows us to use a total batch size of 64 for LLaVA-1.5-7B and 16 for LLaVA-MORE-8B.

A.2 Evaluation Protocol Details

AMBER. The AMBER evaluation set comprises 1,004 manually annotated images, each labeled across four dimensions: object existence (visibility), attributes (properties of visible objects), relations (direct contact between objects), and hallucination targets (objects likely to be imagined based on context). Beyond standard metrics like $\text{CHAIR}_{i}$ and Hallucination Rate, AMBER includes Coverage, which measures object recall as the ratio of mentioned objects to ground truth objects. It also introduces Cognition, a metric designed to assess whether hallucinations produced by MLLMs align with patterns of human cognition. This is computed as the proportion of hallucinated objects that match the predefined hallucinatory targets. All AMBER metrics are macro-averaged across the dataset by computing the score independently for each sample and then averaging the per-sample results. For example, $\text{CHAIR}_{i}$ is computed using Eq. 1 on each image, and the final score is obtained by taking the mean of these values.

CHAIR-MSCOCO and Object HalBench. While the original CHAIR benchmark reports results on the Karpathy test split [Karpathy and Fei-Fei(2015)] and a robust test set [Lu et al.(2018)Lu, Yang, Batra, and Parikh], CHAIR-MSCOCO evaluates model-generated descriptions for 500 images randomly sampled from the MSCOCO validation set. Following the original CHAIR implementation [Rohrbach et al.(2018)Rohrbach, Hendricks, Burns, Darrell, and Saenko], ground-truth object annotations come from COCO ground-truth sentences, while mentioned objects are extracted with an NLP toolkit and preprocessed taking into account plural forms, synonyms, two word compounds to ensure robust evaluation. By contrast, Object HalBench employs a different subsample of MSCOCO validation subset consisting of 300 images. Differently from AMBER, $\text{CHAIR}_{i}$ and $\text{CHAIR}_{s}$ scores are aggregated via micro-averaging.

Appendix B Qualitative Results

We show in Figure 4 and Figure 5 the efficacy of CHAIR-DPO with LLaVA-1.5-7B and LLaVA-MORE-8B respectively. All image descriptions have been generated by conditioning the models with the prompt: Describe the image. Not only CHAIR-DPO reduces visual hallucinations, denoted by the red font, but also encourages the model to concentrate on additional fine-grained details compared to the baseline, which we outline with in blue font. The latter finding especially holds for LLaVA-MORE-8B.

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Bai et al.(2023)Bai, Bai, Yang, Wang, Tan, Wang, Lin, Zhou, and Zhou] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. ar Xiv preprint ar Xiv:2308.12966 , 2023.
2[Bai et al.(2024)Bai, Wang, Xiao, He, Han, Zhang, and Shou] Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of Multimodal Large Language Models: A Survey. ar Xiv preprint ar Xiv:2404.18930 , 2024.
3[Barraco et al.(2023)Barraco, Sarto, Cornia, Baraldi, and Cucchiara] Manuele Barraco, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning. In ICCV , 2023.
4[Caffagni et al.(2024)Caffagni, Cocchi, Barsellotti, Moratelli, Sarto, Baraldi, Baraldi, Cornia, and Cucchiara] Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. The Revolution of Multimodal Large Language Models: A Survey. In ACL Findings , 2024.
5[Carion et al.(2020)Carion, Massa, Synnaeve, Usunier, Kirillov, and Zagoruyko] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-End Object Detection with Transformers. In ECCV , 2020.
6[Chen et al.(2023)Chen, Zhang, Zeng, Zhang, Zhu, and Zhao] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. ar Xiv preprint ar Xiv:2306.15195 , 2023.
7[Chiang et al.(2023)Chiang, Li, Lin, Sheng, Wu, Zhang, Zheng, Zhuang, Zhuang, Gonzalez, Stoica, and Xing] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* Chat GPT Quality, 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/ .
8[Chuang et al.(2024)Chuang, Xie, Luo, Kim, Glass, and He] Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Do La: Decoding by Contrasting Layers Improves Factuality in Large Language Models. In ICLR , 2024.