Vision-Language Dataset Distillation
Xindi Wu, Byron Zhang, Zhiwei Deng, Olga Russakovsky

TL;DR
This paper introduces the first vision-language dataset distillation method that creates small, synthetic datasets from large-scale datasets, significantly improving retrieval performance with fewer training pairs.
Contribution
It pioneers a vision-language dataset distillation approach using trajectory matching and LoRA, addressing the lack of discrete classes in such datasets.
Findings
Significant performance improvements on Flickr30K and COCO benchmarks.
Distillation with 100 pairs nearly doubles retrieval accuracy compared to coreset selection.
Method reduces training data size by an order of magnitude while maintaining high performance.
Abstract
Dataset distillation methods reduce large-scale datasets to smaller sets of synthetic data, preserving sufficient information to quickly train a new model from scratch. However, prior work on dataset distillation has focused exclusively on image classification datasets, whereas modern large-scale datasets are primarily vision-language datasets. In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching. A key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed method jointly distills image-text pairs in a contrastive formulation. Further, we leverage Low-Rank Adaptation (LoRA) matching to enable more efficient and effective trajectory matching in complex modern vision-language models. Since there are no existing baselines, we compare our distillation approach with…
Peer Reviews
Decision·Submitted to ICLR 2024
- (S1) The paper contains a good set of experiments. The authors find a way to compare their method against image-only dataset distillation methods (Table 1) which somewhat isolates the impact of the specific model proposed vs. the task of image-text dataset distillation, as opposed to image-label. Additionally, the authors also experiment by distilling only one modality (either only text or only image) (Table 4), which demonstrates the relative impact of each of the modalities and the combinati
- (W1) The distilled dataset samples shown in the qualitative results (Figure 3) are, in case of images, not very different from the original images - only augmented with some noisy high-frequency patterns, and in case of text, do not consistently appear to be better than the original captions. That raises a question of how robust those distilled datasets are and indicates that maybe the source of effectiveness of distilled datasets is somewhat different from what one would expect, that is, mode
1. This the first paper to perform dataset distillation on the vision-language dataset. 2. Comprehensive experiments are conducted in the paper.
1. The underlying distillation process is the same to MTT, even though the expert model is trained with bi-direction contrastive loss 2. In the bottom on page 1, the authors mention it is hard for text data but in the paper, the authors still distill in the continuous space and then simply find the closet embeddings.
- The paper is very well written and easy to understand. Authors clearly explain their method and provide an intuition behind their method selection - Results are significantly better than related methods with fewer examples. Ablation studies show that the multimodal distillation outperforms distillation with a single modality.
- Storing and training with the trajectory data seems like an expensive process. The addition of multimodal data requires even more resources, such as modality-specific encoders. While I believe that these factors represent significant limitations to the work, I also recognize the substantial contribution it makes to advancing this field.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies
