VILA: On Pre-training for Visual Language Models

Ji Lin; Hongxu Yin; Wei Ping; Yao Lu; Pavlo Molchanov; Andrew Tao,; Huizi Mao; Jan Kautz; Mohammad Shoeybi; Song Han

arXiv:2312.07533·cs.CV·May 20, 2024·6 cites

VILA: On Pre-training for Visual Language Models

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao,, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han

PDF

Open Access 3 Repos 10 Models

TL;DR

This paper systematically studies visual language model pre-training, revealing key design choices that improve performance, and introduces VILA, a new model that outperforms existing state-of-the-art models across benchmarks.

Contribution

It provides an in-depth analysis of VLM pre-training strategies and proposes an effective recipe leading to the development of VILA, a superior visual language model.

Findings

01

Freezing LLMs during pre-training reduces in-context learning.

02

Interleaved data improves pre-training effectiveness.

03

Blending text-only and image-text data enhances task performance.

Abstract

Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling