Visual Object Tracking across Diverse Data Modalities: A Review

Mengmeng Wang; Teli Ma; Shuo Xin; Xiaojun Hou; Jiazheng Xing; Guang; Dai; Jingdong Wang; and Yong Liu

arXiv:2412.09991·cs.CV·December 16, 2024

Visual Object Tracking across Diverse Data Modalities: A Review

Mengmeng Wang, Teli Ma, Shuo Xin, Xiaojun Hou, Jiazheng Xing, Guang, Dai, Jingdong Wang, and Yong Liu

PDF

TL;DR

This survey reviews recent advances in visual object tracking across various data modalities, emphasizing deep learning methods for single-modal and multi-modal scenarios, and provides benchmark comparisons and future insights.

Contribution

It offers a comprehensive categorization and analysis of single-modal and multi-modal VOT frameworks, highlighting recent progress and future directions in the field.

Findings

01

Benchmark results show varying performance across modalities.

02

Deep learning methods dominate recent VOT approaches.

03

Multi-modal VOT is increasingly effective in diverse environments.

Abstract

Visual Object Tracking (VOT) is an attractive and significant research area in computer vision, which aims to recognize and track specific targets in video sequences where the target objects are arbitrary and class-agnostic. The VOT technology could be applied in various scenarios, processing data of diverse modalities such as RGB, thermal infrared and point cloud. Besides, since no one sensor could handle all the dynamic and varying environments, multi-modal VOT is also investigated. This paper presents a comprehensive survey of the recent progress of both single-modal and multi-modal VOT, especially the deep learning methods. Specifically, we first review three types of mainstream single-modal VOT, including RGB, thermal infrared and point cloud tracking. In particular, we conclude four widely-used single-modal frameworks, abstracting their schemas and categorizing the existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.