UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking

Qihua Liang; Liang Chen; Yaozong Zheng; Jian Nong; Zhiyi Mo; Bineng Zhong

arXiv:2601.14799·cs.CV·January 22, 2026

UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking

Qihua Liang, Liang Chen, Yaozong Zheng, Jian Nong, Zhiyi Mo, Bineng Zhong

PDF

Open Access

TL;DR

UBATrack introduces a novel spatio-temporal state space model for multi-modal tracking, effectively capturing cross-modal dependencies and improving robustness without extensive fine-tuning, outperforming existing methods across various benchmarks.

Contribution

The paper proposes UBATrack, a new multi-modal tracking framework utilizing a Mamba-style state space model with modules for spatio-temporal modeling and feature mixing, enhancing efficiency and performance.

Findings

01

Outperforms state-of-the-art on multiple benchmarks

02

Effectively models cross-modal dependencies

03

Improves training efficiency without full fine-tuning

Abstract

Multi-modal object tracking has attracted considerable attention by integrating multiple complementary inputs (e.g., thermal, depth, and event data) to achieve outstanding performance. Although current general-purpose multi-modal trackers primarily unify various modal tracking tasks (i.e., RGB-Thermal infrared, RGB-Depth or RGB-Event tracking) through prompt learning, they still overlook the effective capture of spatio-temporal cues. In this work, we introduce a novel multi-modal tracking framework based on a mamba-style state space model, termed UBATrack. Our UBATrack comprises two simple yet effective modules: a Spatio-temporal Mamba Adapter (STMA) and a Dynamic Multi-modal Feature Mixer. The former leverages Mamba's long-sequence modeling capability to jointly model cross-modal dependencies and spatio-temporal visual cues in an adapter-tuning manner. The latter further enhances…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Gaze Tracking and Assistive Technology