Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation

Quoc-Huy Trinh

arXiv:2511.11177·cs.CV·November 26, 2025

Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation

Quoc-Huy Trinh

PDF

Open Access

TL;DR

Viper-F1 introduces an efficient multimodal model that replaces traditional attention with state-space dynamics and a correlation module, enabling fine-grained vision-language understanding at lower computational costs.

Contribution

The paper proposes Viper-F1, a hybrid state-space model with a novel correlation module, achieving efficient and precise multimodal understanding unlike previous attention-based methods.

Findings

01

Outperforms existing models on multiple benchmarks.

02

Achieves linear-time inference with high accuracy.

03

Effectively captures fine-grained visual details.

Abstract

Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as robotic manipulation, personal assistants, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling