NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization
Xiyuan Wei, Chih-Jen Lin, Tianbao Yang

TL;DR
NeuCLIP introduces a novel neural normalizer optimization framework for large-scale CLIP training, significantly improving the estimation of normalization terms and enhancing model performance on massive datasets.
Contribution
It reformulates the contrastive loss into a minimization problem with an auxiliary variable and transforms it into a neural network prediction task, enabling more accurate normalization in large-scale CLIP training.
Findings
Outperforms previous normalizer estimation methods on datasets from millions to billions of samples.
Achieves more accurate normalization leading to improved CLIP model performance.
Demonstrates scalability and efficiency in large-scale contrastive learning.
Abstract
Accurately estimating the normalization term (also known as the partition function) in the contrastive loss is a central challenge for training Contrastive Language-Image Pre-training (CLIP) models. Conventional methods rely on large batches for approximation, demanding substantial computational resources. To mitigate this issue, prior works introduced per-sample normalizer estimators, which are updated at each epoch in a blockwise coordinate manner to keep track of updated encoders. However, this scheme incurs optimization error that scales with the ratio of dataset size to batch size, limiting effectiveness for large datasets or small batches. To overcome this limitation, we propose NeuCLIP, a novel and elegant optimization framework based on two key ideas: (i) the contrastive loss for each sample into a minimization problem with…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is based on an elegant theoretical foundation combining convex conjugate and variational principles to remove per-sample normalizer tracking. - We see strong empirical validation - NeuCLIP consistently outperforms baselines on multiple datasets and scales favorably to billion-sample training.
- Some hyperparameters (restart frequency, number of prototypes m) seem tuned per dataset; robustness to such choices is not discussed. - It might be worth discussing the computational overhead of the extra NPN updates, restarts, and parameter sync cost to give practitioners more insights to use in reality.
1. The convex-variational reformulation removes the reciprocal-of-estimator bias from mini-batch CLIP and avoids per-sample moving averages in FastCLIP. The unified loss couples encoder and NPN training without needing a separate consistency target for the normalizer. 2. Alternating updates with multiple quick NPN steps and periodic NPN restarts is easy to implement and, per their appendix, gives better stability than simultaneous updates.
1 **Limited accounting of compute and wall-clock.** Results are reported “under the same budget” and with “8 × H100,” but the paper does not provide thorough wall-clock and energy numbers for NeuCLIP vs. strong baselines at equal accuracy. 2. **Breadth of baselines and settings.** SigLIP is included, but the study would benefit from (i) larger-batch SigLIP/OpenCLIP points at matched compute, and (ii) comparisons under stronger data filtering or with modern data recipes, since normalizer accura
1. The loss is reformulated via a Fenchel conjugate into a per-sample minimization whose optimizer equals the log-normalizer; a variational theorem is then invoked to justify searching over functions and learning an NPN. The paper provides the derivations and the induced FastCLIP update as a special case. 2. Results span multiple datasets (millions→billions of pairs), report the Datacomp average plus subset scores, include ablations of hyperparameters (e.g., restart frequency and update count
1. The paper notes dataset download discrepancies vs. AmorLIP (different numbers of successfully fetched samples) yet still compares scores; large-scale web datasets are sensitive to crawl state, which can materially shift results. The largest-scale runs report single numbers without variability, and some baselines rely on third-party code with possible configuration drift—together reducing the strength of “method X > method Y” claims. 2. The estimation-error plots rely on a “true normalizer”
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
