PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong   Vision-language Adapter

Junfei Xiao; Zheng Xu; Alan Yuille; Shen Yan; Boyu Wang

arXiv:2402.10896·cs.CV·June 4, 2024·3 cites

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

Junfei Xiao, Zheng Xu, Alan Yuille, Shen Yan, Boyu Wang

PDF

Open Access

TL;DR

This paper introduces PaLM2-VAdapter, a progressively aligned language model that effectively bridges vision encoders and LLMs, achieving state-of-the-art performance with fewer parameters and faster convergence in vision-language tasks.

Contribution

It proposes a novel progressively aligned language model as a vision-language adapter, improving convergence speed, scalability, and efficiency over previous perceiver resampler-based methods.

Findings

01

Faster convergence compared to perceiver resampler baseline

02

Higher performance on VQA and captioning tasks

03

Achieves state-of-the-art results with 30-70% fewer parameters

Abstract

This paper demonstrates that a progressively aligned language model can effectively bridge frozen vision encoders and large language models (LLMs). While the fundamental architecture and pre-training methods of vision encoders and LLMs have been extensively studied, the architecture and training strategy of vision-language adapters vary significantly across recent works. Our research undertakes a thorough exploration of the state-of-the-art perceiver resampler architecture and builds a strong baseline. However, we observe that the vision-language alignment with perceiver resampler exhibits slow convergence and limited scalability with a lack of direct supervision. To address this issue, we propose PaLM2-VAdapter, employing a progressively aligned language model as the vision-language adapter. Compared to the strong baseline with perceiver resampler, our method empirically shows faster…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling