Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Xudong Li; Mengdan Zhang; Peixian Chen; Xiawu Zheng; Yan Zhang; Jingyuan Zheng; Yunhang Shen; Ke Li; Chaoyou Fu; Xing Sun; Rongrong Ji

arXiv:2505.22396·cs.CV·May 29, 2025

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Xudong Li, Mengdan Zhang, Peixian Chen, Xiawu Zheng, Yan Zhang, Jingyuan Zheng, Yunhang Shen, Ke Li, Chaoyou Fu, Xing Sun, Rongrong Ji

PDF

Open Access

TL;DR

This paper introduces CcDPO, a hierarchical preference optimization framework that improves multi-image understanding in Multi-modal Large Language Models by focusing on global context and local visual details, reducing hallucinations.

Contribution

The paper presents a novel multi-level preference optimization method, CcDPO, that enhances multi-image perception in MLLMs by integrating global context and local visual cues, and introduces the MultiScope-42k dataset.

Findings

01

Significantly reduces hallucinations in multi-image tasks.

02

Achieves consistent performance improvements across various tasks.

03

Enhances perception by multi-level preference optimization.

Abstract

Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details. It features: (i) Context-Level Optimization : Re-evaluates cognitive biases underlying MLLMs' multi-image context comprehension and integrates a spectrum of low-cost global sequence preferences for bias mitigation. (ii) Needle-Level Optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques