XTrack: Multimodal Training Boosts RGB-X Video Object Trackers
Yuedong Tan, Zongwei Wu, Yuqian Fu, Zhuyun Zhou, Guolei Sun, Eduard, Zamfi, Chao Ma, Danda Pani Paudel, Luc Van Gool, Radu Timofte

TL;DR
XTrack introduces a multimodal training approach for RGB-X video object tracking that leverages cross-modal knowledge sharing during training to improve inference performance, achieving a +3% precision gain over state-of-the-art methods.
Contribution
The paper proposes a novel mixture-of-experts framework with a modality classifier to facilitate cross-modal knowledge sharing during training, enhancing RGB-X tracker performance.
Findings
Achieved +3% precision improvement over SOTA.
Effective knowledge transfer between modalities during training.
Beneficial even with limited paired multimodal data.
Abstract
Multimodal sensing has proven valuable for visual tracking, as different sensor types offer unique strengths in handling one specific challenging scene where object appearance varies. While a generalist model capable of leveraging all modalities would be ideal, development is hindered by data sparsity, typically in practice, only one modality is available at a time. Therefore, it is crucial to ensure and achieve that knowledge gained from multimodal sensing -- such as identifying relevant features and regions -- is effectively shared, even when certain modalities are unavailable at inference. We venture with a simple assumption: similar samples across different modalities have more knowledge to share than otherwise. To implement this, we employ a ``weak" classifier tasked with distinguishing between modalities. More specifically, if the classifier ``fails" to accurately identify the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection
