Leveraging Multi-Modal Information to Enhance Dataset Distillation
Zhe Li, Hadrien Reynaud, Bernhard Kainz

TL;DR
This paper introduces a multi-modal dataset distillation method that combines visual and textual information with object-centric masking to produce compact, privacy-preserving synthetic datasets with improved utility.
Contribution
It proposes a novel multi-modal framework with caption-guided supervision and object-centric masking, enhancing dataset distillation beyond visual-only approaches.
Findings
Improves downstream task performance.
Enhances privacy by reducing real data exposure.
Achieves better object-focused data representation.
Abstract
Dataset distillation aims to create a small and highly representative synthetic dataset that preserves the essential information of a larger real dataset. Beyond reducing storage and computational costs, related approaches offer a promising avenue for privacy preservation in computer vision by eliminating the need to store or share sensitive real-world images. Existing methods focus solely on optimizing visual representations, overlooking the potential of multi-modal information. In this work, we propose a multi-modal dataset distillation framework that incorporates two key enhancements: caption-guided supervision and object-centric masking. To leverage textual information, we introduce two strategies: caption concatenation, which fuses caption embeddings with visual features during classification, and caption matching, which enforces semantic alignment between real and synthetic data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Machine Learning and Data Classification · Neural Networks and Applications
MethodsFocus
