Learning Tracking Representations via Dual-Branch Fully Transformer   Networks

Fei Xie; Chunyu Wang; Guangting Wang; Wankou Yang; Wenjun Zeng

arXiv:2112.02571·cs.CV·December 7, 2021

Learning Tracking Representations via Dual-Branch Fully Transformer Networks

Fei Xie, Chunyu Wang, Guangting Wang, Wankou Yang, Wenjun Zeng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a purely Transformer-based dual-branch network for object tracking that learns features through matching, achieving competitive accuracy and real-time speed without CNNs.

Contribution

The paper proposes a novel fully Transformer dual-branch network for tracking, emphasizing feature learning from matching, which improves performance and efficiency.

Findings

01

Outperforms state-of-the-art on GOT-10k and VOT2020 benchmarks.

02

Achieves real-time inference at about 40 fps on a single GPU.

03

Features are learned directly from matching, aligning with tracking tasks.

Abstract

We present a Siamese-like Dual-branch network based on solely Transformers for tracking. Given a template and a search image, we divide them into non-overlapping patches and extract a feature vector for each patch based on its matching results with others within an attention window. For each token, we estimate whether it contains the target object and the corresponding size. The advantage of the approach is that the features are learned from matching, and ultimately, for matching. So the features are aligned with the object tracking task. The method achieves better or comparable results as the best-performing methods which first use CNN to extract features and then use Transformer to fuse them. It outperforms the state-of-the-art methods on the GOT-10k and VOT2020 benchmarks. In addition, the method achieves real-time inference speed (about $40$ fps) on one GPU. The code and models will…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

phiphiphi31/dualtfr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Softmax · Residual Connection · Adam · Dropout · Position-Wise Feed-Forward Layer · Layer Normalization