CLIMP: Contrastive Language-Image Mamba Pretraining
Nimrod Shabtay, Itamar Zimerman, Eli Schwartz, Raja Giryes

TL;DR
CLIMP introduces a Mamba-based contrastive vision-language model that improves robustness, efficiency, and flexibility over traditional Transformer-based CLIP, enabling better cross-modal retrieval and out-of-distribution performance.
Contribution
This work presents the first fully Mamba-based contrastive vision-language model, replacing both encoders to enhance efficiency, robustness, and input resolution flexibility.
Findings
Surpasses CLIP-ViT-B by 7.5% on ImageNet-O
Achieves 6.6% higher retrieval accuracy at 16x resolution
Uses 5x less memory and 1.8x fewer FLOPs
Abstract
Contrastive Language-Image Pre-training (CLIP) relies on Vision Transformers whose attention mechanism is susceptible to spurious correlations, and scales quadratically with resolution. To address these limitations, We present CLIMP, the first fully Mamba-based contrastive vision-language model that replaces both the vision and text encoders with Mamba. The new architecture encodes sequential structure in both vision and language, with VMamba capturing visual spatial inductive biases, reducing reliance on spurious correlations and producing an embedding space favorable for cross-modal retrieval and out-of-distribution robustness-surpassing OpenAI's CLIP-ViT-B by 7.5% on ImageNet-O. CLIMP naturally supports variable input resolutions without positional encoding interpolation or specialized training, achieving up to 6.6% higher retrieval accuracy at 16x training resolution while using 5x…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
