MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking

Simiao Lai; Chang Liu; Jiawen Zhu; Ben Kang; Yang Liu; Dong Wang,; Huchuan Lu

arXiv:2408.07889·cs.CV·August 16, 2024

MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking

Simiao Lai, Chang Liu, Jiawen Zhu, Ben Kang, Yang Liu, Dong Wang,, Huchuan Lu

PDF

Open Access 1 Repo

TL;DR

MambaVT introduces a novel spatio-temporal modeling framework for RGB-T tracking using the Mamba State Space Model, achieving state-of-the-art results with lower computational costs by effectively capturing long and short-term temporal information.

Contribution

This work presents the first pure Mamba-based RGB-T tracking framework, leveraging its long sequence modeling and linear complexity to improve robustness and efficiency.

Findings

01

Achieves state-of-the-art performance on four benchmarks.

02

Requires lower computational costs compared to existing methods.

03

Demonstrates effective exploitation of spatio-temporal context in RGB-T tracking.

Abstract

Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal information. Inspired by the recently emerged State Space Model Mamba, renowned for its impressive long sequence modeling capabilities and linear computational complexity, this work innovatively proposes a pure Mamba-based framework (MambaVT) to fully exploit spatio-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations, and introduce short-term historical trajectory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

laisimiao/MambaVT
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Vision and Imaging · Autonomous Vehicle Technology and Safety

MethodsLinear Layer · Layer Normalization · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections