SCAR: A Characterization Scheme for Multi-Modal Dataset

Ri Su; Zhao Chen; Caleb Chen Cao; Nan Tang; Lei Chen

arXiv:2508.19659·cs.LG·August 28, 2025

SCAR: A Characterization Scheme for Multi-Modal Dataset

Ri Su, Zhao Chen, Caleb Chen Cao, Nan Tang, Lei Chen

PDF

TL;DR

SCAR is a novel framework for characterizing datasets' intrinsic structural properties across scale, coverage, authenticity, and richness, enabling better understanding and efficient augmentation of multi-modal datasets for foundation models.

Contribution

Introduces SCAR, a robust scheme for dataset characterization that remains invariant under scaling and guides data augmentation for improved model generalization.

Findings

01

SCAR effectively predicts data utility across diverse datasets.

02

Foundation data can preserve model generalization without retraining.

03

SCAR-guided data expansion improves multimodal dataset quality.

Abstract

Foundation models exhibit remarkable generalization across diverse tasks, largely driven by the characteristics of their training data. Recent data-centric methods like pruning and compression aim to optimize training but offer limited theoretical insight into how data properties affect generalization, especially the data characteristics in sample scaling. Traditional perspectives further constrain progress by focusing predominantly on data quantity and training efficiency, often overlooking structural aspects of data quality. In this study, we introduce SCAR, a principled scheme for characterizing the intrinsic structural properties of datasets across four key measures: Scale, Coverage, Authenticity, and Richness. Unlike prior data-centric measures, SCAR captures stable characteristics that remain invariant under dataset scaling, providing a robust and general foundation for data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.