Enhancing Visual-Language Modality Alignment in Large Vision Language   Models via Self-Improvement

Xiyao Wang; Jiuhai Chen; Zhaoyang Wang; Yuhang Zhou; Yiyang Zhou,; Huaxiu Yao; Tianyi Zhou; Tom Goldstein; Parminder Bhatia; Furong Huang; Cao; Xiao

arXiv:2405.15973·cs.CV·February 11, 2025·3 cites

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

Xiyao Wang, Jiuhai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou,, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Furong Huang, Cao, Xiao

PDF

Open Access 2 Repos

TL;DR

This paper introduces SIMA, a self-improvement framework for large vision-language models that enhances modality alignment using self-generated responses and internal critic mechanisms, without external data dependencies.

Contribution

SIMA is a novel self-improvement approach that improves LVLMs' modality alignment by leveraging existing datasets and internal critics, avoiding external models or data.

Findings

01

Significant performance improvements on 14 benchmarks.

02

Outperforms previous methods in modality alignment.

03

Effective self-critic with new visual metrics.

Abstract

Large vision-language models (LVLMs) have achieved impressive results in visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there remains significant room for improvement in aligning visual and language modalities. Existing methods often depend on external models or data, leading to uncontrollable and unstable alignment results. In this paper, we propose SIMA, a self-improvement framework that enhances visual and language modality alignment without external dependencies. SIMA leverages existing vision instruction tuning datasets to self-generate responses, incorporating an in-context self-critic mechanism that constructs preference pairs for tuning. Crucially, our approach allows LVLMs to act as critics by designing effective critic prompts, eliminating the need for additional fine-tuning with external instruction data. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques