Single-Model and Any-Modality for Video Object Tracking

Zongwei Wu; Jilai Zheng; Xiangxuan Ren; Florin-Alexandru Vasluianu,; Chao Ma; Danda Pani Paudel; Luc Van Gool; Radu Timofte

arXiv:2311.15851·cs.CV·April 1, 2024·1 cites

Single-Model and Any-Modality for Video Object Tracking

Zongwei Wu, Jilai Zheng, Xiangxuan Ren, Florin-Alexandru Vasluianu,, Chao Ma, Danda Pani Paudel, Luc Van Gool, Radu Timofte

PDF

Open Access 1 Repo

TL;DR

Un-Track is a unified transformer-based video object tracker capable of handling any modality, including missing ones, by learning a shared latent space from RGB-X pairs, achieving state-of-the-art results across multiple datasets.

Contribution

This work introduces Un-Track, the first single-model tracker that unifies multiple modalities using a shared latent space learned solely from RGB-X pairs, enabling effective multi-modality tracking.

Findings

01

Achieves +8.1 F-score improvement on DepthTrack dataset.

02

Surpasses state-of-the-art unified and modality-specific trackers on five benchmarks.

03

Adds minimal computational overhead with +2.14 GFLOPs and 6.6M parameters.

Abstract

In the realm of video object tracking, auxiliary modalities such as depth, thermal, or event data have emerged as valuable assets to complement the RGB trackers. In practice, most existing RGB trackers learn a single set of parameters to use them across datasets and applications. However, a similar single-model unification for multi-modality tracking presents several challenges. These challenges stem from the inherent heterogeneity of inputs -- each with modality-specific representations, the scarcity of multi-modal datasets, and the absence of all the modalities at all times. In this work, we introduce Un-Track, a Unified Tracker of a single set of parameters for any modality. To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques. More importantly, we use only the RGB-X pairs to learn the common latent space.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zongwei97/untrack
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Advanced Neural Network Applications

MethodsSparse Evolutionary Training