Towards Universal Modal Tracking with Online Dense Temporal Token Learning

Yaozong Zheng; Bineng Zhong; Qihua Liang; Shengping Zhang; Guorong Li; Xianxian Li; Rongrong Ji

arXiv:2507.20177·cs.CV·July 30, 2025

Towards Universal Modal Tracking with Online Dense Temporal Token Learning

Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, Rongrong Ji

PDF

TL;DR

This paper introduces {oldsymbol{ exttt{modaltracker}}}, a universal, online dense temporal token learning model for multi-modal video tracking that supports various modalities and achieves state-of-the-art results.

Contribution

The paper presents a unified model architecture with online dense temporal token association and modality-scalable gated perceivers for multi-modal video tracking.

Findings

01

Achieves new state-of-the-art performance on multiple benchmarks.

02

Supports various modalities with a single model architecture.

03

Reduces training complexity through one-shot training scheme.

Abstract

We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called {\modaltracker}). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: \textbf{Video-level Sampling}. We expand the model's inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. \textbf{Video-level Association}. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. \textbf{Modality Scalable}. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.