MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large   Vision-Language Models

Ziyu Liu; Yuhang Zang; Xiaoyi Dong; Pan Zhang; Yuhang Cao; Haodong; Duan; Conghui He; Yuanjun Xiong; Dahua Lin; Jiaqi Wang

arXiv:2410.17637·cs.CV·October 24, 2024

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong, Duan, Conghui He, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

PDF

Open Access 1 Repo

TL;DR

MIA-DPO introduces a cost-effective multi-image preference alignment method for large vision-language models, leveraging data augmentation and attention-based filtering to improve multi-image task performance without extra annotations.

Contribution

The paper proposes MIA-DPO, a novel approach that extends single-image data with unrelated images and uses attention values for filtering, enabling effective multi-image preference alignment without human annotations or external models.

Findings

01

Outperforms existing methods on five multi-image benchmarks.

02

Achieves 3.0% performance boost on LLaVA-v1.5.

03

Achieves 4.3% performance boost on InternLM-XC2.5.

Abstract

Visual preference alignment involves training Large Vision-Language Models (LVLMs) to predict human preferences between visual inputs. This is typically achieved by using labeled datasets of chosen/rejected pairs and employing optimization algorithms like direct preference optimization (DPO). Existing visual alignment methods, primarily designed for single-image scenarios, struggle to effectively handle the complexity of multi-image tasks due to the scarcity of diverse training data and the high cost of annotating chosen/rejected pairs. We present Multi-Image Augmented Direct Preference Optimization (MIA-DPO), a visual preference alignment approach that effectively handles multi-image inputs. MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats, significantly reducing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liuziyu77/mia-dpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Constraint Satisfaction and Optimization

MethodsSoftmax · Attention Is All You Need