VILA: On Pre-training for Visual Language Models
Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao,, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han

TL;DR
This paper systematically studies visual language model pre-training, revealing key design choices that improve performance, and introduces VILA, a new model that outperforms existing state-of-the-art models across benchmarks.
Contribution
It provides an in-depth analysis of VLM pre-training strategies and proposes an effective recipe leading to the development of VILA, a superior visual language model.
Findings
Freezing LLMs during pre-training reduces in-context learning.
Interleaved data improves pre-training effectiveness.
Blending text-only and image-text data enhances task performance.
Abstract
Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Efficient-Large-Model/VILA-13bmodel· 16 dl· ♡ 2016 dl♡ 20
- 🤗Efficient-Large-Model/VILA-7bmodel· 53 dl· ♡ 2753 dl♡ 27
- 🤗Efficient-Large-Model/VILA-7b-4bit-awqmodel· 18 dl· ♡ 218 dl♡ 2
- 🤗Efficient-Large-Model/VILA-13b-4bit-awqmodel· 7 dl· ♡ 27 dl♡ 2
- 🤗Efficient-Large-Model/VILA-2.7bmodel· 131 dl· ♡ 15131 dl♡ 15
- 🤗Efficient-Large-Model/VILA1.5-3bmodel· 4.6k dl· ♡ 334.6k dl♡ 33
- 🤗Efficient-Large-Model/VILA1.5-13bmodel· 365 dl· ♡ 5365 dl♡ 5
- 🤗Efficient-Large-Model/Llama-3-VILA1.5-8Bmodel· 338 dl· ♡ 37338 dl♡ 37
- 🤗Efficient-Large-Model/VILA1.5-40bmodel· 21 dl· ♡ 1721 dl♡ 17
- 🤗Efficient-Large-Model/VILA1.5-3b-s2model· 25 dl· ♡ 225 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
