VILA$^2$: VILA Augmented VILA

Yunhao Fang; Ligeng Zhu; Yao Lu; Yan Wang; Pavlo Molchanov; Jan Kautz,; Jang Hyun Cho; Marco Pavone; Song Han; Hongxu Yin

arXiv:2407.17453·cs.CV·November 4, 2024

VILA$^2$: VILA Augmented VILA

Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jan Kautz,, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin

PDF

TL;DR

VILA$^2$ introduces a self-augmentation and specialist-augmentation scheme for visual language models, improving data quality and model performance without human labeling, leading to cost-efficient dataset enhancement.

Contribution

The paper presents a novel VLM augmentation method combining self- and specialist-augmentation, significantly enhancing data quality and model accuracy without human annotation.

Findings

01

Multiple self-augmentation rounds improve downstream accuracy.

02

Specialist skill finetuning enhances caption diversity.

03

Data quality improvements are validated by GPT-4V, Gemini, and humans.

Abstract

While visual language model architectures and training infrastructures advance rapidly, data curation remains under-explored where quantity and quality become a bottleneck. Existing work either crawls extra Internet data with a loose guarantee of quality or distills from black-box proprietary models, e.g., GPT-4V / Gemini that are API frequency and performance bounded. This work enables a VLM to improve itself via data enhancement, exploiting its generative nature. We introduce a simple yet effective VLM augmentation scheme that includes a self-augment step and a specialist-augment step to iteratively improve data quality and hence, model performance. In the self-augment step, the instruction-finetuned VLM recaptions its pretraining caption datasets and then retrains from scratch leveraging refined data. Without any expensive human-in-the-loop annotation, we observe improvements in data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.