Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

Lingyi Hong; Jinglun Li; Xinyu Zhou; Kaixun Jiang; Pinxue Guo; Zhaoyu Chen; Runze Li; Xingdong Sheng; Wenqiang Zhang

arXiv:2605.03716·cs.CV·May 6, 2026

Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

Lingyi Hong, Jinglun Li, Xinyu Zhou, Kaixun Jiang, Pinxue Guo, Zhaoyu Chen, Runze Li, Xingdong Sheng, Wenqiang Zhang

PDF

TL;DR

OneTrackerV2 is a unified multimodal tracking framework that employs dual mixture-of-experts to achieve state-of-the-art performance across various modalities and benchmarks with high efficiency.

Contribution

It introduces a novel unified end-to-end training framework with Meta Merger and Dual MoE, enabling flexible modality fusion and improved robustness in multimodal tracking.

Findings

01

Achieves state-of-the-art results on five tracking tasks and 12 benchmarks.

02

Maintains strong performance even after model compression.

03

Demonstrates robustness under modality-missing scenarios.

Abstract

Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multi-modal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and robustness. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.