VILA$^2$: VILA Augmented VILA
Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jan Kautz,, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin

TL;DR
VILA$^2$ introduces a self-augmentation and specialist-augmentation scheme for visual language models, improving data quality and model performance without human labeling, leading to cost-efficient dataset enhancement.
Contribution
The paper presents a novel VLM augmentation method combining self- and specialist-augmentation, significantly enhancing data quality and model accuracy without human annotation.
Findings
Multiple self-augmentation rounds improve downstream accuracy.
Specialist skill finetuning enhances caption diversity.
Data quality improvements are validated by GPT-4V, Gemini, and humans.
Abstract
While visual language model architectures and training infrastructures advance rapidly, data curation remains under-explored where quantity and quality become a bottleneck. Existing work either crawls extra Internet data with a loose guarantee of quality or distills from black-box proprietary models, e.g., GPT-4V / Gemini that are API frequency and performance bounded. This work enables a VLM to improve itself via data enhancement, exploiting its generative nature. We introduce a simple yet effective VLM augmentation scheme that includes a self-augment step and a specialist-augment step to iteratively improve data quality and hence, model performance. In the self-augment step, the instruction-finetuned VLM recaptions its pretraining caption datasets and then retrains from scratch leveraging refined data. Without any expensive human-in-the-loop annotation, we observe improvements in data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
