TrTr: Visual Tracking with Transformer

Moju Zhao; Kei Okada; Masayuki Inaba

arXiv:2105.03817·cs.CV·May 11, 2021·73 cites

TrTr: Visual Tracking with Transformer

Moju Zhao, Kei Okada, Masayuki Inaba

PDF

Open Access 1 Repo

TL;DR

This paper introduces TrTr, a visual tracking method that leverages Transformer encoder-decoder architecture to capture global contextual information, outperforming traditional correlation-based trackers on multiple benchmarks.

Contribution

The paper proposes a novel Transformer-based tracker architecture that models global dependencies for improved visual tracking performance.

Findings

01

Outperforms state-of-the-art on multiple benchmarks

02

Effective use of Transformer for global context modeling

03

Competitive accuracy and robustness in tracking tasks

Abstract

Template-based discriminative trackers are currently the dominant tracking methods due to their robustness and accuracy, and the Siamese-network-based methods that depend on cross-correlation operation between features extracted from template and search images show the state-of-the-art tracking performance. However, general cross-correlation operation can only obtain relationship between local patches in two feature maps. In this paper, we propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture to gain global and rich contextual interdependencies. In this new architecture, features of the template image is processed by a self-attention module in the encoder part to learn strong context information, which is then sent to the decoder part to compute cross-attention with the search image features processed by another…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tongtybj/TrTr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · UAV Applications and Optimization · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Softmax · Layer Normalization · Label Smoothing · Byte Pair Encoding