Coreset Selection via LLM-based Concept Bottlenecks
Akshay Mehra, Trisha Mittal, Subhadra Gopalakrishnan, Joshua Kimball

TL;DR
This paper introduces a novel coreset selection method that uses large language model-derived concept bottlenecks to efficiently identify representative data subsets without training the full model, improving performance and interpretability.
Contribution
It proposes a new difficulty score based on human-understandable concepts derived from LLMs, enabling efficient, model-agnostic coreset selection without full dataset training.
Findings
Outperforms random subsets at high pruning rates
Achieves comparable or better performance than training dynamics methods
Works effectively on unlabeled datasets
Abstract
Coreset Selection (CS) aims to identify a subset of the training dataset that achieves model performance comparable to using the entire dataset. Many state-of-the-art CS methods select coresets using scores whose computation requires training the downstream model on the entire dataset first and recording changes in the model's behavior on samples as it trains (training dynamics). These scores are inefficient to compute and hard to interpret, as they do not indicate whether a sample is difficult to learn in general or only for a specific downstream model. Our work addresses these challenges by proposing a score that computes a sample's difficulty using human-understandable textual attributes (concepts) independent of any downstream model. Specifically, we measure the alignment between a sample's visual features and concept bottlenecks, derived via large language models, by training a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsCoresets · Pruning
