Awesome Multi-modal Object Tracking
Chunhui Zhang, Li Liu, Hao Wen, Xi Zhou, Yanfeng Wang

TL;DR
This paper provides a comprehensive review of multi-modal object tracking, categorizing existing tasks, analyzing datasets and algorithms, and highlighting recent advances and challenges in integrating multiple data modalities.
Contribution
It offers a systematic categorization and analysis of MMOT tasks, datasets, and algorithms, and summarizes recent progress and future directions in the field.
Findings
Existing MMOT mainly focus on two modalities.
Recent efforts aim for unified models for any modality.
Large-scale multi-modal benchmarks have been established.
Abstract
Multi-modal object tracking (MMOT) is an emerging field that combines data from various modalities, \eg vision (RGB), depth, thermal infrared, event, language and audio, to estimate the state of an arbitrary object in a video sequence. It is of great significance for many applications such as autonomous driving and intelligent surveillance. In recent years, MMOT has received more and more attention. However, existing MMOT algorithms mainly focus on two modalities (\eg RGB+depth, RGB+thermal infrared, and RGB+language). To leverage more modalities, some recent efforts have been made to learn a unified visual object tracking model for any modality. Additionally, some large-scale multi-modal tracking benchmarks have been established by simultaneously providing more than two modalities, such as vision-language-audio (\eg WebUAV-3M) and vision-depth-language (\eg UniMod1K). To track the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Infrared Target Detection Methodologies
MethodsFocus
