Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Shang-Jui Ray Kuo; Paola Cascante-Bonilla

arXiv:2603.19209·cs.CV·March 20, 2026

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Shang-Jui Ray Kuo, Paola Cascante-Bonilla

PDF

Open Access 1 Models

TL;DR

This paper evaluates state space model (SSM) vision backbones as alternatives to transformer-based encoders in vision-language models, demonstrating competitive performance and robustness, especially at smaller scales.

Contribution

It systematically compares SSM and transformer backbones in VLMs, showing SSMs as a viable, efficient alternative with improved robustness and competitive results.

Findings

01

SSM backbones achieve strong performance in VQA and grounding.

02

Dense-task tuning enhances backbone performance.

03

Larger ImageNet accuracy does not guarantee better VLM results.

Abstract

Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
raykuo188/vlm-ssm-vision-encoders-checkpoints
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning