EVLF: Early Vision-Language Fusion for Generative Dataset Distillation
Wenqi Cai, Yawen Zou, Guang Li, Chunzhi Gu, Chao Zhang

TL;DR
The paper introduces EVLF, an early fusion method for diffusion-based dataset distillation that improves the quality of synthetic data by aligning visual and textual embeddings early, leading to better downstream classification performance.
Contribution
EVLF is a plug-and-play early fusion approach that enhances diffusion-based dataset distillation by integrating visual and textual information at an early stage, improving data quality.
Findings
EVLF produces more semantically faithful synthetic data.
EVLF improves downstream classification accuracy.
EVLF is compatible with various architectures and schedules.
Abstract
Dataset distillation (DD) aims to synthesize compact training sets that enable models to achieve high accuracy with significantly fewer samples. Recent diffusion-based DD methods commonly introduce semantic guidance through late-stage cross-attention, where textual prompts tend to dominate the generative process. Although this strategy enforces label relevance, it diminishes the contribution of visual latents, resulting in over-corrected samples that mirror prompt patterns rather than reflecting intrinsic visual features. To solve this problem, we introduce an Early Vision-Language Fusion (EVLF) method that aligns textual and visual embeddings at the transition between the encoder and the generative backbone. By incorporating a lightweight cross-attention module at this transition, the early representations simultaneously encode local textures and global semantic directions across the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis
