Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

Tiezheng Zhang; Yitong Li; Yu-cheng Chou; Jieneng Chen; Alan Yuille; Chen Wei; Junfei Xiao

arXiv:2507.07104·cs.CV·July 14, 2025

Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

Tiezheng Zhang, Yitong Li, Yu-cheng Chou, Jieneng Chen, Alan Yuille, Chen Wei, Junfei Xiao

PDF

1 Datasets

TL;DR

This paper presents a scalable, cost-efficient framework for training vision-language models by distilling knowledge from pretrained diffusion and language models, reducing data needs and computational costs.

Contribution

Introducing the VLV auto-encoder framework that leverages pretrained vision, diffusion, and language models for efficient knowledge distillation and captioning.

Findings

01

Achieves state-of-the-art captioning performance comparable to GPT-4o and Gemini 2.0 Flash.

02

Reduces training costs to under $1,000 USD by using existing pretrained models.

03

Effectively distills semantic understanding from diffusion models using continuous embeddings.

Abstract

Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ccvl/LAION-High-Qualtiy-Pro-6M-VLV
dataset· 2.0k dl
2.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.