A Dataset is Worth 1 MB
Elad Kimchi Shoshani, Leeyam Gabay, Yedid Hoshen

TL;DR
This paper introduces PLADA, a method that drastically reduces dataset transmission costs by transmitting only class labels for selected images from a large reference dataset, enabling efficient task transfer with minimal payload.
Contribution
The paper proposes a novel label-based data transmission method that eliminates pixel transfer and addresses distribution mismatch through dataset pruning.
Findings
Achieves task transfer with less than 1 MB payload.
Maintains high classification accuracy across diverse datasets.
Effectively filters reference datasets to relevant images for target tasks.
Abstract
A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on diverse hardware and software frameworks, transmitting a pre-trained model is often infeasible; instead, agents require raw data to train their own task-specific models locally. While dataset distillation attempts to compress training signals, current methods struggle to scale to high-resolution data and rarely achieve sufficiently small files. In this paper, we propose Pseudo-Labels as Data (PLADA), a method that completely eliminates pixel transmission. We assume agents are preloaded with a large, generic, unlabeled reference dataset (e.g., ImageNet-1K, ImageNet-21K) and communicate a new task by transmitting only the class labels for specific images. To address the distribution mismatch between the reference and target datasets, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
