CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection
Boran Zhao, Hetian Liu, Zhenxian Hu, Yuqing Yuan, Yu Yan, Pengju Ren

TL;DR
CAST is a novel framework for multimodal dataset selection that improves the quality and efficiency of training large models by capturing cross-modal topology and multi-scale semantic structures.
Contribution
It introduces a collapse-aware multi-scale topology fusion method for better coreset selection in multimodal datasets, addressing limitations of existing single-modality and coarse-grained approaches.
Findings
CAST outperforms existing dataset selection methods on Flickr30K and MS-COCO.
It demonstrates superior cross-architecture generalization.
It achieves better energy efficiency in multimodal training.
Abstract
The training of large multimodal models fundamentally relies on massive image-text datasets, which inevitably incur prohibitive computational overhead. Dataset selection offers a promising paradigm by identifying a highly informative coreset. However, existing approaches suffer from two critical limitations: (i) single-modality-dominated sampling methods, which ignore the fine-grained cross-modal information imbalance inherent in multimodal datasets and thus lead to semantic loss in the other modality; and (ii) coarse-grained sample-scoring-based sampling methods, where the selected coreset tends to be biased toward the scoring model, making it difficult to guarantee distributional equivalence between the coreset and the original dataset. Meanwhile, existing distribution matching and discrete sampling strategies often fail to jointly account for global semantic structure, local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
