Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
Tiezheng Zhang, Yitong Li, Yu-cheng Chou, Jieneng Chen, Alan Yuille, Chen Wei, Junfei Xiao

TL;DR
This paper presents a scalable, cost-efficient framework for training vision-language models by distilling knowledge from pretrained diffusion and language models, reducing data needs and computational costs.
Contribution
Introducing the VLV auto-encoder framework that leverages pretrained vision, diffusion, and language models for efficient knowledge distillation and captioning.
Findings
Achieves state-of-the-art captioning performance comparable to GPT-4o and Gemini 2.0 Flash.
Reduces training costs to under $1,000 USD by using existing pretrained models.
Effectively distills semantic understanding from diffusion models using continuous embeddings.
Abstract
Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
