SCAR: A Characterization Scheme for Multi-Modal Dataset
Ri Su, Zhao Chen, Caleb Chen Cao, Nan Tang, Lei Chen

TL;DR
SCAR is a novel framework for characterizing datasets' intrinsic structural properties across scale, coverage, authenticity, and richness, enabling better understanding and efficient augmentation of multi-modal datasets for foundation models.
Contribution
Introduces SCAR, a robust scheme for dataset characterization that remains invariant under scaling and guides data augmentation for improved model generalization.
Findings
SCAR effectively predicts data utility across diverse datasets.
Foundation data can preserve model generalization without retraining.
SCAR-guided data expansion improves multimodal dataset quality.
Abstract
Foundation models exhibit remarkable generalization across diverse tasks, largely driven by the characteristics of their training data. Recent data-centric methods like pruning and compression aim to optimize training but offer limited theoretical insight into how data properties affect generalization, especially the data characteristics in sample scaling. Traditional perspectives further constrain progress by focusing predominantly on data quantity and training efficiency, often overlooking structural aspects of data quality. In this study, we introduce SCAR, a principled scheme for characterizing the intrinsic structural properties of datasets across four key measures: Scale, Coverage, Authenticity, and Richness. Unlike prior data-centric measures, SCAR captures stable characteristics that remain invariant under dataset scaling, providing a robust and general foundation for data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
