Uni-MDTrack: Learning Decoupled Memory and Dynamic States for Parameter-Efficient Visual Tracking in All Modality

Wenrui Cai; Zhenyi Lu; Yuzhe Li; Yongchao Feng; Jinqing Zhang; Qingjie Liu; Yunhong Wang

arXiv:2603.14452·cs.CV·March 17, 2026

Uni-MDTrack: Learning Decoupled Memory and Dynamic States for Parameter-Efficient Visual Tracking in All Modality

Wenrui Cai, Zhenyi Lu, Yuzhe Li, Yongchao Feng, Jinqing Zhang, Qingjie Liu, Yunhong Wang

PDF

Open Access

TL;DR

Uni-MDTrack introduces a novel, parameter-efficient approach for multi-modal visual tracking that leverages memory compression and dynamic state fusion, achieving state-of-the-art results across diverse datasets with reduced computational costs.

Contribution

The paper proposes Uni-MDTrack, a new framework with Memory-Aware Compression Prompt and Dynamic State Fusion modules, enabling efficient, unified multi-modal tracking with improved performance and generality.

Findings

01

Achieves state-of-the-art results on 10 datasets across five modalities.

02

Only 30% of parameters need training for substantial performance gains.

03

MCP and DSF modules are effective plug-and-play components.

Abstract

With the advent of Transformer-based one-stream trackers that possess strong capability in inter-frame relation modeling, recent research has increasingly focused on how to introduce spatio-temporal context. However, most existing methods rely on a limited number of historical frames, which not only leads to insufficient utilization of the context, but also inevitably increases the length of input and incurs prohibitive computational overhead. Methods that query an external memory bank, on the other hand, suffer from inadequate fusion between the retrieved spatio-temporal features and the backbone. Moreover, using discrete historical frames as context overlooks the rich dynamics of the target. To address the issues, we propose Uni-MDTrack, which consists of two core components: Memory-Aware Compression Prompt (MCP) module and Dynamic State Fusion (DSF) module. MCP effectively compresses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Gaze Tracking and Assistive Technology