EVLF: Early Vision-Language Fusion for Generative Dataset Distillation

Wenqi Cai; Yawen Zou; Guang Li; Chunzhi Gu; Chao Zhang

arXiv:2603.07476·cs.CV·March 10, 2026

EVLF: Early Vision-Language Fusion for Generative Dataset Distillation

Wenqi Cai, Yawen Zou, Guang Li, Chunzhi Gu, Chao Zhang

PDF

Open Access

TL;DR

The paper introduces EVLF, an early fusion method for diffusion-based dataset distillation that improves the quality of synthetic data by aligning visual and textual embeddings early, leading to better downstream classification performance.

Contribution

EVLF is a plug-and-play early fusion approach that enhances diffusion-based dataset distillation by integrating visual and textual information at an early stage, improving data quality.

Findings

01

EVLF produces more semantically faithful synthetic data.

02

EVLF improves downstream classification accuracy.

03

EVLF is compatible with various architectures and schedules.

Abstract

Dataset distillation (DD) aims to synthesize compact training sets that enable models to achieve high accuracy with significantly fewer samples. Recent diffusion-based DD methods commonly introduce semantic guidance through late-stage cross-attention, where textual prompts tend to dominate the generative process. Although this strategy enforces label relevance, it diminishes the contribution of visual latents, resulting in over-corrected samples that mirror prompt patterns rather than reflecting intrinsic visual features. To solve this problem, we introduce an Early Vision-Language Fusion (EVLF) method that aligns textual and visual embeddings at the transition between the encoder and the generative backbone. By incorporating a lightweight cross-attention module at this transition, the early representations simultaneously encode local textures and global semantic directions across the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis