Bootstrapping Vision-Language Learning with Decoupled Language   Pre-training

Yiren Jian; Chongyang Gao; Soroush Vosoughi

arXiv:2307.07063·cs.CV·December 21, 2023·5 cites

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

Yiren Jian, Chongyang Gao, Soroush Vosoughi

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel prompt prediction approach for vision-language pre-training that improves performance and reduces data requirements by focusing on language prompts, trained solely on linguistic data.

Contribution

It proposes the Prompt-Transformer (P-Former), a language-only trained model that predicts optimal prompts to align visual features with text, enhancing VL pre-training efficiency.

Findings

01

Significantly improves image-to-text baseline performance.

02

Reduces performance gap between models trained on different data scales.

03

Demonstrates modality-agnostic applicability, including video learning.

Abstract

We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language (VL) pre-training. The current paradigm uses visual features as prompts to guide language models, with a focus on determining the most relevant visual features for corresponding text. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. We introduce the Prompt-Transformer (P-Former), a model that predicts these ideal prompts, which is trained exclusively on linguistic data, bypassing the need for image-text pairings. This strategy subtly bifurcates the end-to-end VL training process into an additional, separate stage. Our experiments reveal that our framework significantly enhances the performance of a robust image-to-text baseline (BLIP-2), and effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yiren-jian/blitext
pytorchOfficial

Videos

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsFocus · Balanced Selection · ALIGN