CLIMP: Contrastive Language-Image Mamba Pretraining

Nimrod Shabtay; Itamar Zimerman; Eli Schwartz; Raja Giryes

arXiv:2601.06891·cs.CV·January 13, 2026

CLIMP: Contrastive Language-Image Mamba Pretraining

Nimrod Shabtay, Itamar Zimerman, Eli Schwartz, Raja Giryes

PDF

Open Access

TL;DR

CLIMP introduces a Mamba-based contrastive vision-language model that improves robustness, efficiency, and flexibility over traditional Transformer-based CLIP, enabling better cross-modal retrieval and out-of-distribution performance.

Contribution

This work presents the first fully Mamba-based contrastive vision-language model, replacing both encoders to enhance efficiency, robustness, and input resolution flexibility.

Findings

01

Surpasses CLIP-ViT-B by 7.5% on ImageNet-O

02

Achieves 6.6% higher retrieval accuracy at 16x resolution

03

Uses 5x less memory and 1.8x fewer FLOPs

Abstract

Contrastive Language-Image Pre-training (CLIP) relies on Vision Transformers whose attention mechanism is susceptible to spurious correlations, and scales quadratically with resolution. To address these limitations, We present CLIMP, the first fully Mamba-based contrastive vision-language model that replaces both the vision and text encoders with Mamba. The new architecture encodes sequential structure in both vision and language, with VMamba capturing visual spatial inductive biases, reducing reliance on spurious correlations and producing an embedding space favorable for cross-modal retrieval and out-of-distribution robustness-surpassing OpenAI's CLIP-ViT-B by 7.5% on ImageNet-O. CLIMP naturally supports variable input resolutions without positional encoding interpolation or specialized training, achieving up to 6.6% higher retrieval accuracy at 16x training resolution while using 5x…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning