Serial Over Parallel: Learning Continual Unification for Multi-Modal Visual Object Tracking and Benchmarking

Zhangyong Tang; Tianyang Xu; Xuefeng Zhu; Chunyang Cheng; Tao Zhou; Xiaojun Wu; Josef Kittler

arXiv:2508.10655·cs.CV·August 15, 2025

Serial Over Parallel: Learning Continual Unification for Multi-Modal Visual Object Tracking and Benchmarking

Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Chunyang Cheng, Tao Zhou, Xiaojun Wu, Josef Kittler

PDF

TL;DR

This paper introduces UniBench300, a unified benchmark for multi-modal visual object tracking, and reformulates the unification process as a continual learning task to improve performance and reduce inference time.

Contribution

It presents a new unified benchmark, UniBench300, and reformulates multi-modal tracking unification as a continual learning problem, enhancing efficiency and consistency.

Findings

01

UniBench300 reduces inference passes by 27%

02

Continual learning improves unification stability

03

Modality discrepancies affect degradation levels

Abstract

Unifying multiple multi-modal visual object tracking (MMVOT) tasks draws increasing attention due to the complementary nature of different modalities in building robust tracking systems. Existing practices mix all data sensor types in a single training procedure, structuring a parallel paradigm from the data-centric perspective and aiming for a global optimum on the joint distribution of the involved tasks. However, the absence of a unified benchmark where all types of data coexist forces evaluations on separated benchmarks, causing \textit{inconsistency} between training and testing, thus leading to performance \textit{degradation}. To address these issues, this work advances in two aspects: \ding{182} A unified benchmark, coined as UniBench300, is introduced to bridge the inconsistency by incorporating multiple task data, reducing inference passes from three to one and cutting time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.