CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment
Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li

TL;DR
CLIPin is a non-contrastive plug-in designed to enhance multimodal semantic alignment in CLIP architectures, improving robustness and generalization across diverse datasets and tasks.
Contribution
It introduces a novel non-contrastive module that seamlessly integrates with CLIP, combining contrastive and non-contrastive learning for better semantic alignment.
Findings
Improves robustness of multimodal representations.
Enhances alignment on diverse downstream tasks.
Compatible with various contrastive frameworks.
Abstract
Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model's ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- CLIPin is genuinely plug-and-play requiring no changes to base encoders and demonstrates consistent improvements when added to multiple frameworks (ALBEF, BLIP, CoCa). - The paper includes ablation studies, generalization tests, per-category breakdowns, and qualitative Grad-CAM visualizations, strengthening the empirical claims.
1. The paper conflates two fundamentally distinct data issues—noisy weak supervision in natural datasets and low textual diversity in medical reports—into a single failure mode of InfoNCE. However, these problems require different mitigation strategies (e.g., robust loss vs. diversity-aware sampling). No quantitative evidence (e.g., negative sample misclassification rate, alignment entropy) is provided to justify this unified framing. 2. The use of a non-public medical dataset ([Private Dataset]
The paper is well-grounded. It clearly identifies a practical, well-known flaw in InfoNCE—its vulnerability to noisy and redundant data—as its primary motivation . The shared pre-projector is a clever fix for a known conflict between contrastive and non-contrastive projector designs . This design makes the "plug-in" claim credible and is a nice engineering contribution. The ablations in Table 3 and Table 7 effectively show that the non-contrastive component is unstable on its own (prone to col
The paper’s main comparison is to xCLIP. This is too narrow. It ignores other, very similar methods like Cosmos (Kim et al., 2025), which also uses cross-modality self-distillation. The novelty of this work is questionable without a more thorough discussion of these closely related non-contrastive multimodal frameworks. The evaluation is almost entirely focused on classification (linear probe and ZSC). This is a major omission. A primary and arguably the most important use case for CLIP is cros
1. This paper proposes a unified plug-in which can seamlessly integrate modular non-contrastive strategy into existing contrastive frameworks like CLIP. 2. Experiments on downstream tasks like linear probing and prompt-based out-of-distribution zero-shot classification demonstrate that the proposed method can facilitate general and robust representation learning.
1. Regarding Lines 40-41, the authors attribute CLIP's issues on medical datasets to "semantically similar samples being treated as negative sample pairs." However, this phenomenon appears fundamentally similar to the many-to-many correspondence problem in natural datasets, where a single image/caption can be relevant to multiple batch samples. The introduction of distinct terms—"semantic looseness" for natural and "semantic redundancy" for medical datasets—for what seems to be a conceptually si
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicslinguistics and terminology studies · Natural Language Processing Techniques
