MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong, Duan, Conghui He, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

TL;DR
MIA-DPO introduces a cost-effective multi-image preference alignment method for large vision-language models, leveraging data augmentation and attention-based filtering to improve multi-image task performance without extra annotations.
Contribution
The paper proposes MIA-DPO, a novel approach that extends single-image data with unrelated images and uses attention values for filtering, enabling effective multi-image preference alignment without human annotations or external models.
Findings
Outperforms existing methods on five multi-image benchmarks.
Achieves 3.0% performance boost on LLaVA-v1.5.
Achieves 4.3% performance boost on InternLM-XC2.5.
Abstract
Visual preference alignment involves training Large Vision-Language Models (LVLMs) to predict human preferences between visual inputs. This is typically achieved by using labeled datasets of chosen/rejected pairs and employing optimization algorithms like direct preference optimization (DPO). Existing visual alignment methods, primarily designed for single-image scenarios, struggle to effectively handle the complexity of multi-image tasks due to the scarcity of diverse training data and the high cost of annotating chosen/rejected pairs. We present Multi-Image Augmented Direct Preference Optimization (MIA-DPO), a visual preference alignment approach that effectively handles multi-image inputs. MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats, significantly reducing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Constraint Satisfaction and Optimization
MethodsSoftmax · Attention Is All You Need
