XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

Yuedong Tan; Zongwei Wu; Yuqian Fu; Zhuyun Zhou; Guolei Sun; Eduard; Zamfi; Chao Ma; Danda Pani Paudel; Luc Van Gool; Radu Timofte

arXiv:2405.17773·cs.CV·December 2, 2024

XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

Yuedong Tan, Zongwei Wu, Yuqian Fu, Zhuyun Zhou, Guolei Sun, Eduard, Zamfi, Chao Ma, Danda Pani Paudel, Luc Van Gool, Radu Timofte

PDF

Open Access 2 Repos

TL;DR

XTrack introduces a multimodal training approach for RGB-X video object tracking that leverages cross-modal knowledge sharing during training to improve inference performance, achieving a +3% precision gain over state-of-the-art methods.

Contribution

The paper proposes a novel mixture-of-experts framework with a modality classifier to facilitate cross-modal knowledge sharing during training, enhancing RGB-X tracker performance.

Findings

01

Achieved +3% precision improvement over SOTA.

02

Effective knowledge transfer between modalities during training.

03

Beneficial even with limited paired multimodal data.

Abstract

Multimodal sensing has proven valuable for visual tracking, as different sensor types offer unique strengths in handling one specific challenging scene where object appearance varies. While a generalist model capable of leveraging all modalities would be ideal, development is hindered by data sparsity, typically in practice, only one modality is available at a time. Therefore, it is crucial to ensure and achieve that knowledge gained from multimodal sensing -- such as identifying relevant features and regions -- is effectively shared, even when certain modalities are unavailable at inference. We venture with a simple assumption: similar samples across different modalities have more knowledge to share than otherwise. To implement this, we employ a ``weak" classifier tasked with distinguishing between modalities. More specifically, if the classifier ``fails" to accurately identify the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection