CHiP: Cross-modal Hierarchical Direct Preference Optimization for   Multimodal LLMs

Jinlan Fu; Shenzhen Huangfu; Hao Fei; Xiaoyu Shen; Bryan Hooi; Xipeng; Qiu; See-Kiong Ng

arXiv:2501.16629·cs.CL·January 29, 2025

CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs

Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng, Qiu, See-Kiong Ng

PDF

Open Access 1 Repo

TL;DR

This paper introduces CHiP, a hierarchical preference optimization method for multimodal large language models, which effectively reduces hallucinations by learning from both visual and textual preferences at multiple levels.

Contribution

We propose a novel Cross-modal Hierarchical DPO framework that incorporates visual and multi-level textual preferences to improve hallucination mitigation in MLLMs.

Findings

01

CHiP outperforms DPO in hallucination reduction on Object HalBench.

02

CHiP achieves 52.7% and 55.5% relative improvements with Muffin and LLaVA models.

03

The method effectively aligns image and text representations.

Abstract

Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities. Recent studies have attempted to mitigate this by applying Direct Preference Optimization (DPO) to multimodal scenarios using preference pairs from text-based responses. However, our analysis of representation distributions reveals that multimodal DPO struggles to align image and text representations and to distinguish between hallucinated and non-hallucinated descriptions. To address these challenges, in this work, we propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations. We introduce a visual preference optimization module within the DPO framework, enabling MLLMs to learn from both textual and visual preferences simultaneously. Furthermore, we propose a hierarchical textual preference optimization module that allows the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lvugai/chip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Web Applications and Data Management · Speech and dialogue systems

MethodsDirect Preference Optimization · ALIGN · Balanced Selection