MixFormerV2: Efficient Fully Transformer Tracking

Yutao Cui; Tianhui Song; Gangshan Wu; Limin Wang

arXiv:2305.15896·cs.CV·February 8, 2024·28 cites

MixFormerV2: Efficient Fully Transformer Tracking

Yutao Cui, Tianhui Song, Gangshan Wu, Limin Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

MixFormerV2 introduces a fully transformer-based tracking framework that enhances efficiency through special prediction tokens and a distillation-based model reduction, achieving high accuracy and real-time speeds on benchmarks.

Contribution

The paper proposes a novel fully transformer tracking framework without dense convolutions, utilizing prediction tokens and distillation techniques for improved efficiency and performance.

Findings

01

Achieves 70.6% AUC on LaSOT with 165 FPS on GPU.

02

Surpasses FEAR-L by 2.7% AUC on LaSOT with CPU real-time speed.

03

Introduces a distillation-based model reduction paradigm.

Abstract

Transformer-based trackers have achieved strong accuracy on the standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to overcome this issue, we propose a fully transformer tracking framework, coined as \emph{MixFormerV2}, without any dense convolutional operation and complex score prediction module. Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas. Then, we apply the unified transformer backbone on these mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mcg-nju/mixformerv2
pytorchOfficial

Videos

MixFormerV2: Efficient Fully Transformer Tracking· slideslive

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Human Pose and Action Recognition

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings