Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs
Xudong Li, Mengdan Zhang, Peixian Chen, Xiawu Zheng, Yan Zhang, Jingyuan Zheng, Yunhang Shen, Ke Li, Chaoyou Fu, Xing Sun, Rongrong Ji

TL;DR
This paper introduces CcDPO, a hierarchical preference optimization framework that improves multi-image understanding in Multi-modal Large Language Models by focusing on global context and local visual details, reducing hallucinations.
Contribution
The paper presents a novel multi-level preference optimization method, CcDPO, that enhances multi-image perception in MLLMs by integrating global context and local visual cues, and introduces the MultiScope-42k dataset.
Findings
Significantly reduces hallucinations in multi-image tasks.
Achieves consistent performance improvements across various tasks.
Enhances perception by multi-level preference optimization.
Abstract
Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details. It features: (i) Context-Level Optimization : Re-evaluates cognitive biases underlying MLLMs' multi-image context comprehension and integrates a spectrum of low-cost global sequence preferences for bias mitigation. (ii) Needle-Level Optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
