Vision-Language Dataset Distillation

Xindi Wu; Byron Zhang; Zhiwei Deng; Olga Russakovsky

arXiv:2308.07545·cs.CV·August 21, 2024

Vision-Language Dataset Distillation

Xindi Wu, Byron Zhang, Zhiwei Deng, Olga Russakovsky

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper introduces the first vision-language dataset distillation method that creates small, synthetic datasets from large-scale datasets, significantly improving retrieval performance with fewer training pairs.

Contribution

It pioneers a vision-language dataset distillation approach using trajectory matching and LoRA, addressing the lack of discrete classes in such datasets.

Findings

01

Significant performance improvements on Flickr30K and COCO benchmarks.

02

Distillation with 100 pairs nearly doubles retrieval accuracy compared to coreset selection.

03

Method reduces training data size by an order of magnitude while maintaining high performance.

Abstract

Dataset distillation methods reduce large-scale datasets to smaller sets of synthetic data, preserving sufficient information to quickly train a new model from scratch. However, prior work on dataset distillation has focused exclusively on image classification datasets, whereas modern large-scale datasets are primarily vision-language datasets. In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching. A key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed method jointly distills image-text pairs in a contrastive formulation. Further, we leverage Low-Rank Adaptation (LoRA) matching to enable more efficient and effective trajectory matching in complex modern vision-language models. Since there are no existing baselines, we compare our distillation approach with…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- (S1) The paper contains a good set of experiments. The authors find a way to compare their method against image-only dataset distillation methods (Table 1) which somewhat isolates the impact of the specific model proposed vs. the task of image-text dataset distillation, as opposed to image-label. Additionally, the authors also experiment by distilling only one modality (either only text or only image) (Table 4), which demonstrates the relative impact of each of the modalities and the combinati

Weaknesses

- (W1) The distilled dataset samples shown in the qualitative results (Figure 3) are, in case of images, not very different from the original images - only augmented with some noisy high-frequency patterns, and in case of text, do not consistently appear to be better than the original captions. That raises a question of how robust those distilled datasets are and indicates that maybe the source of effectiveness of distilled datasets is somewhat different from what one would expect, that is, mode

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

1. This the first paper to perform dataset distillation on the vision-language dataset. 2. Comprehensive experiments are conducted in the paper.

Weaknesses

1. The underlying distillation process is the same to MTT, even though the expert model is trained with bi-direction contrastive loss 2. In the bottom on page 1, the authors mention it is hard for text data but in the paper, the authors still distill in the continuous space and then simply find the closet embeddings.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

- The paper is very well written and easy to understand. Authors clearly explain their method and provide an intuition behind their method selection - Results are significantly better than related methods with fewer examples. Ablation studies show that the multimodal distillation outperforms distillation with a single modality.

Weaknesses

- Storing and training with the trajectory data seems like an expensive process. The addition of multimodal data requires even more resources, such as modality-specific encoders. While I believe that these factors represent significant limitations to the work, I also recognize the substantial contribution it makes to advancing this field.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies