Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
Shang-Jui Ray Kuo, Paola Cascante-Bonilla

TL;DR
This paper evaluates state space model (SSM) vision backbones as alternatives to transformer-based encoders in vision-language models, demonstrating competitive performance and robustness, especially at smaller scales.
Contribution
It systematically compares SSM and transformer backbones in VLMs, showing SSMs as a viable, efficient alternative with improved robustness and competitive results.
Findings
SSM backbones achieve strong performance in VQA and grounding.
Dense-task tuning enhances backbone performance.
Larger ImageNet accuracy does not guarantee better VLM results.
Abstract
Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
