MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking
Xinqi Liu, Li Zhou, Zikun Zhou, Jianqiu Chen, and Zhenyu He

TL;DR
MambaVLT introduces a novel time-evolving multimodal state space model for vision-language tracking, effectively capturing temporal information and dynamically updating reference features to improve tracking robustness.
Contribution
The paper presents a new Mamba-based model that leverages state space evolution for multimodal tracking, integrating a hybrid state space block, locality enhancement, and modality selection for improved performance.
Findings
Outperforms state-of-the-art trackers on multiple benchmarks.
Effectively models temporal information with linear complexity.
Dynamically balances visual and language references to reduce ambiguity.
Abstract
The vision-language tracking task aims to perform object tracking based on various modality references. Existing Transformer-based vision-language tracking methods have made remarkable progress by leveraging the global modeling ability of self-attention. However, current approaches still face challenges in effectively exploiting the temporal information and dynamically updating reference features during tracking. Recently, the State Space Model (SSM), known as Mamba, has shown astonishing ability in efficient long-sequence modeling. Particularly, its state space evolving process demonstrates promising capabilities in memorizing multimodal temporal information with linear complexity. Witnessing its success, we propose a Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT. In particular,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Gaze Tracking and Assistive Technology
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces
