Expedited Training of Visual Conditioned Language Generation via   Redundancy Reduction

Yiren Jian; Tingkai Liu; Yunzhe Tao; Chunhui Zhang; Soroush Vosoughi,; Hongxia Yang

arXiv:2310.03291·cs.CV·February 22, 2024

Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction

Yiren Jian, Tingkai Liu, Yunzhe Tao, Chunhui Zhang, Soroush Vosoughi,, Hongxia Yang

PDF

Open Access 1 Repo

TL;DR

This paper presents EVL_Gen, a fast and efficient framework for pre-training vision-language models that reduces training time by merging visual tokens gradually, achieving comparable performance with less data and enabling video adaptation.

Contribution

Introduces a one-stage, single-loss training framework that accelerates vision-language model pre-training by five times and reduces data requirements, while maintaining performance.

Findings

01

Training speed increased by a factor of 5.

02

Models perform well with only 10% of the data.

03

Framework adapts to video-conditioned tasks.

Abstract

In this paper, we introduce $EVL_{Gen}$ , a streamlined framework designed for the pre-training of visually conditioned language generation models with high computational demands, utilizing frozen pre-trained large language models (LLMs). The conventional approach in vision-language pre-training (VLP) typically involves a two-stage optimization process: an initial resource-intensive phase dedicated to general-purpose vision-language representation learning, focused on extracting and consolidating relevant visual features. This is followed by a subsequent phase that emphasizes end-to-end alignment between visual and linguistic modalities. Our novel one-stage, single-loss framework bypasses the computationally demanding first training stage by gradually merging similar visual tokens during training, while avoiding model collapse caused by single-stage training of BLIP-2 type…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yiren-jian/evlgen
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings