Concept-Aware Batch Sampling Improves Language-Image Pretraining

Adhiraj Ghosh; Vishaal Udandarao; Thao Nguyen; Matteo Farina; Mehdi Cherti; Jenia Jitsev; Sewoong Oh; Elisa Ricci; Ludwig Schmidt; Matthias Bethge

arXiv:2511.20643·cs.CV·November 26, 2025

Concept-Aware Batch Sampling Improves Language-Image Pretraining

Adhiraj Ghosh, Vishaal Udandarao, Thao Nguyen, Matteo Farina, Mehdi Cherti, Jenia Jitsev, Sewoong Oh, Elisa Ricci, Ludwig Schmidt, Matthias Bethge

PDF

Open Access 1 Datasets 4 Reviews

TL;DR

This paper introduces a flexible, online concept-based data curation method called CABS, which improves language-image pretraining by constructing batches based on specific concept distributions, leading to better model performance.

Contribution

The paper presents DataConcept, a large annotated dataset, and CABS, a novel batch sampling framework that enhances vision-language model training through task-adaptive concept curation.

Findings

01

CABS improves model performance across 28 benchmarks.

02

CABS enables flexible, task-specific concept distribution control.

03

CABS outperforms offline, concept-agnostic data curation methods.

Abstract

What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 5

Strengths

- A new data curation mechanism with online method to adjust the data distribution seen by the model to improve the training effectiveness. - The motivation to change existing random sampling is appreciated. - The evaluation is comprehensive.

Weaknesses

- The definition of concept is critical for the method development. Then, the discussion on why the current concept bank definition is optimal is needed. As the author mentioned MetaCLIP many times, I am curious how the concept bank different from the metadata used in MetaCLIP (in MetaCLIP, the balanced distribution according to metadata is one critical standard for MetaCLIP dataset construction. - Following, when the concepts bank contains erroneous or missed concepts, how your method can robu

Reviewer 02Rating 2Confidence 5

Strengths

- Achieves better zero-shot performance than random sampling. Another heuristic variant that samples examples with more concepts yields a better performance for retrieval. - Provides empirical validation on benchmarks.

Weaknesses

I don't see the novelty or new insights provided by this paper. The idea that balanced mini-batches improve the convergence and performance on smaller groups of data is well-known. This idea has been used before in many different domains, including federated learning, data selection, etc. From optimization perspective, the reason is that balanced mini-batches have smaller gradient variance which yield faster convergence, which is theoretically analyzed and shown by several existing papers in the

Reviewer 03Rating 4Confidence 3

Strengths

1. The constructed concept-aware pretraining dataset with rewritten context-aware captions seems promising and useful. 2. The experiment in table1 shows a clear advantage of concept-aware re-captions in LIP.

Weaknesses

1. While the curated concept-aware pretraining dataset is meaningful to the community, it seems the proposed CABS doesn’t work well with the dataset. Different from other sampling strategies that may achieve consistent improvement across classification and retrieval tasks, CABS may perform well on classification but worse on retrieval tasks. Though the authors proposed an alternative CABS-FM to improve performance on retrieval tasks, it makes the total design complicated since a hard choice need

Reviewer 04Rating 4Confidence 5

Strengths

The paper shows that (1) relabeling captions to emphasize image concepts (DataConcept) and (2) pretraining with concept-balanced batches (CAPS-DM) consistently improve zero-shot classification across diverse settings. Zero-shot retrieval accuracy also increases with the optimal batching strategy (CAPS-FM). These conclusions are supported by extensive experiments.

Weaknesses

- The method performs well on zero-shot classification but degrades zero-shot retrieval accuracy under the default CAPS-DM. Although the CAPS-FM variant improves retrieval, relying on different setups for different applications weakens the contribution, as a single pretrained model is generally expected to work across tasks. If the model underperforms on either classification or retrieval, it may also struggle on downstream tasks such as detection and segmentation, which undermines the promise o

Code & Models

Datasets

bethgelab/dataconcept_128M
dataset· 14k dl
14k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques