CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation

Weiquan Huang; Yifei Shen; Yifan Yang

arXiv:2404.19394·cs.CV·May 1, 2024·1 cites

CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation

Weiquan Huang, Yifei Shen, Yifan Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces CLIP-pretrained Mamba models, demonstrating their competitive zero-shot classification performance, robustness to OOD data, and analyzing their complex training landscape through Hessian analysis.

Contribution

First to train and evaluate transferable Mamba models with CLIP pretraining across diverse datasets and OOD conditions, revealing their efficiency and training challenges.

Findings

01

Mamba models with 67M parameters match 307M ViT in zero-shot tasks.

02

Mamba models excel in OOD robustness under contrast and high-pass filtering.

03

Hessian analysis shows Mamba models have sharper, more non-convex loss landscapes.

Abstract

State space models and Mamba-based models have been increasingly applied across various domains, achieving state-of-the-art performance. This technical report introduces the first attempt to train a transferable Mamba model utilizing contrastive language-image pretraining (CLIP). We have trained Mamba models of varying sizes and undertaken comprehensive evaluations of these models on 26 zero-shot classification datasets and 16 out-of-distribution (OOD) datasets. Our findings reveal that a Mamba model with 67 million parameters is on par with a 307 million-parameter Vision Transformer (ViT) model in zero-shot classification tasks, highlighting the parameter efficiency of Mamba models. In tests of OOD generalization, Mamba-based models exhibit exceptional performance in conditions of OOD image contrast or when subjected to high-pass filtering. However, a Hessian analysis indicates that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

raytrun/mamba-clip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Dropout · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Vision Transformer · Linear Layer · Dense Connections