CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs
Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng, Qiu, See-Kiong Ng

TL;DR
This paper introduces CHiP, a hierarchical preference optimization method for multimodal large language models, which effectively reduces hallucinations by learning from both visual and textual preferences at multiple levels.
Contribution
We propose a novel Cross-modal Hierarchical DPO framework that incorporates visual and multi-level textual preferences to improve hallucination mitigation in MLLMs.
Findings
CHiP outperforms DPO in hallucination reduction on Object HalBench.
CHiP achieves 52.7% and 55.5% relative improvements with Muffin and LLaVA models.
The method effectively aligns image and text representations.
Abstract
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities. Recent studies have attempted to mitigate this by applying Direct Preference Optimization (DPO) to multimodal scenarios using preference pairs from text-based responses. However, our analysis of representation distributions reveals that multimodal DPO struggles to align image and text representations and to distinguish between hallucinated and non-hallucinated descriptions. To address these challenges, in this work, we propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations. We introduce a visual preference optimization module within the DPO framework, enabling MLLMs to learn from both textual and visual preferences simultaneously. Furthermore, we propose a hierarchical textual preference optimization module that allows the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Web Applications and Data Management · Speech and dialogue systems
MethodsDirect Preference Optimization · ALIGN · Balanced Selection
