Single-Model and Any-Modality for Video Object Tracking
Zongwei Wu, Jilai Zheng, Xiangxuan Ren, Florin-Alexandru Vasluianu,, Chao Ma, Danda Pani Paudel, Luc Van Gool, Radu Timofte

TL;DR
Un-Track is a unified transformer-based video object tracker capable of handling any modality, including missing ones, by learning a shared latent space from RGB-X pairs, achieving state-of-the-art results across multiple datasets.
Contribution
This work introduces Un-Track, the first single-model tracker that unifies multiple modalities using a shared latent space learned solely from RGB-X pairs, enabling effective multi-modality tracking.
Findings
Achieves +8.1 F-score improvement on DepthTrack dataset.
Surpasses state-of-the-art unified and modality-specific trackers on five benchmarks.
Adds minimal computational overhead with +2.14 GFLOPs and 6.6M parameters.
Abstract
In the realm of video object tracking, auxiliary modalities such as depth, thermal, or event data have emerged as valuable assets to complement the RGB trackers. In practice, most existing RGB trackers learn a single set of parameters to use them across datasets and applications. However, a similar single-model unification for multi-modality tracking presents several challenges. These challenges stem from the inherent heterogeneity of inputs -- each with modality-specific representations, the scarcity of multi-modal datasets, and the absence of all the modalities at all times. In this work, we introduce Un-Track, a Unified Tracker of a single set of parameters for any modality. To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques. More importantly, we use only the RGB-X pairs to learn the common latent space.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Advanced Neural Network Applications
MethodsSparse Evolutionary Training
