MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language   Tracking

Xinqi Liu; Li Zhou; Zikun Zhou; Jianqiu Chen; and Zhenyu He

arXiv:2411.15459·cs.CV·November 26, 2024

MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking

Xinqi Liu, Li Zhou, Zikun Zhou, Jianqiu Chen, and Zhenyu He

PDF

Open Access

TL;DR

MambaVLT introduces a novel time-evolving multimodal state space model for vision-language tracking, effectively capturing temporal information and dynamically updating reference features to improve tracking robustness.

Contribution

The paper presents a new Mamba-based model that leverages state space evolution for multimodal tracking, integrating a hybrid state space block, locality enhancement, and modality selection for improved performance.

Findings

01

Outperforms state-of-the-art trackers on multiple benchmarks.

02

Effectively models temporal information with linear complexity.

03

Dynamically balances visual and language references to reduce ambiguity.

Abstract

The vision-language tracking task aims to perform object tracking based on various modality references. Existing Transformer-based vision-language tracking methods have made remarkable progress by leveraging the global modeling ability of self-attention. However, current approaches still face challenges in effectively exploiting the temporal information and dynamically updating reference features during tracking. Recently, the State Space Model (SSM), known as Mamba, has shown astonishing ability in efficient long-sequence modeling. Particularly, its state space evolving process demonstrates promising capabilities in memorizing multimodal temporal information with linear complexity. Witnessing its success, we propose a Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT. In particular,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Gaze Tracking and Assistive Technology

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces