Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation
Jinda Xu, Yuhao Song, Daming Wang, Weiwei Zhao, Minghua Chen, Kangliang Chen, Qinya Li

TL;DR
EcoDatum is a learning-driven, multimodal data curation method that enhances data quality and efficiency, outperforming state-of-the-art techniques and significantly improving model training outcomes.
Contribution
The paper introduces EcoDatum, a novel ensemble approach for multimodal data curation that incorporates a quality-guided deduplication and automated optimization, advancing beyond traditional heuristics.
Findings
EcoDatum outperforms existing SOTA methods on DataComp leaderboard.
Achieves a 28% improvement over baseline in data curation quality.
Demonstrates significant enhancement in model training efficiency.
Abstract
In an era overwhelmed by vast amounts of data, the effective curation of web-crawl datasets is essential for optimizing model performance. This paper tackles the challenges associated with the unstructured and heterogeneous nature of such datasets. Traditional heuristic curation methods often inadequately capture complex features, resulting in biases and the exclusion of relevant data. We introduce an advanced, learning-driven approach, Ensemble Curation Of DAta ThroUgh Multimodal Operators (EcoDatum), incorporating a novel quality-guided deduplication method to ensure balanced feature distributions. EcoDatum strategically integrates various unimodal and multimodal data curation operators within a weak supervision ensemble framework, utilizing automated optimization to score each data point effectively. EcoDatum, which significantly improves the data curation quality and efficiency,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Data Quality and Management
