Learning Spatio-Temporal Transformer for Visual Tracking

Bin Yan; Houwen Peng; Jianlong Fu; Dong Wang; Huchuan Lu

arXiv:2103.17154·cs.CV·April 1, 2021

Learning Spatio-Temporal Transformer for Visual Tracking

Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, Huchuan Lu

PDF

1 Repo

TL;DR

This paper introduces a novel end-to-end spatio-temporal transformer-based tracking architecture that directly predicts object bounding boxes, achieving state-of-the-art results efficiently without complex postprocessing.

Contribution

It proposes a new transformer-based tracking method that models global spatio-temporal dependencies and predicts object locations directly, simplifying the tracking pipeline.

Findings

01

Achieves state-of-the-art performance on five benchmarks.

02

Runs at real-time speed, 6x faster than Siam R-CNN.

03

Does not require postprocessing steps like bounding box smoothing.

Abstract

In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a query embedding to predict the spatial positions of the target objects. Our method casts object tracking as a direct bounding box prediction problem, without using any proposals or predefined anchors. With the encoder-decoder transformer, the prediction of objects just uses a simple fully-convolutional network, which estimates the corners of objects directly. The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines. The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

researchmm/Stark
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax