MixFormerV2: Efficient Fully Transformer Tracking
Yutao Cui, Tianhui Song, Gangshan Wu, Limin Wang

TL;DR
MixFormerV2 introduces a fully transformer-based tracking framework that enhances efficiency through special prediction tokens and a distillation-based model reduction, achieving high accuracy and real-time speeds on benchmarks.
Contribution
The paper proposes a novel fully transformer tracking framework without dense convolutions, utilizing prediction tokens and distillation techniques for improved efficiency and performance.
Findings
Achieves 70.6% AUC on LaSOT with 165 FPS on GPU.
Surpasses FEAR-L by 2.7% AUC on LaSOT with CPU real-time speed.
Introduces a distillation-based model reduction paradigm.
Abstract
Transformer-based trackers have achieved strong accuracy on the standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to overcome this issue, we propose a fully transformer tracking framework, coined as \emph{MixFormerV2}, without any dense convolutional operation and complex score prediction module. Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas. Then, we apply the unified transformer backbone on these mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Human Pose and Action Recognition
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
