Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian, Chongyang Gao, Soroush Vosoughi

TL;DR
This paper introduces a novel prompt prediction approach for vision-language pre-training that improves performance and reduces data requirements by focusing on language prompts, trained solely on linguistic data.
Contribution
It proposes the Prompt-Transformer (P-Former), a language-only trained model that predicts optimal prompts to align visual features with text, enhancing VL pre-training efficiency.
Findings
Significantly improves image-to-text baseline performance.
Reduces performance gap between models trained on different data scales.
Demonstrates modality-agnostic applicability, including video learning.
Abstract
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language (VL) pre-training. The current paradigm uses visual features as prompts to guide language models, with a focus on determining the most relevant visual features for corresponding text. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. We introduce the Prompt-Transformer (P-Former), a model that predicts these ideal prompts, which is trained exclusively on linguistic data, bypassing the need for image-text pairings. This strategy subtly bifurcates the end-to-end VL training process into an additional, separate stage. Our experiments reveal that our framework significantly enhances the performance of a robust image-to-text baseline (BLIP-2), and effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsFocus · Balanced Selection · ALIGN
