Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking
Ning Wang, Wengang Zhou, Jie Wang, Houqaing Li

TL;DR
This paper introduces a transformer-based framework that leverages temporal context across video frames to significantly improve the robustness and accuracy of visual object tracking, outperforming existing methods.
Contribution
It designs a novel transformer architecture with separate encoder and decoder branches within a Siamese tracking pipeline, enhancing feature reinforcement and cue propagation for better tracking.
Findings
Outperforms current top trackers on benchmark datasets.
Sets new state-of-the-art records in visual tracking.
End-to-end trainable framework with transformer integration.
Abstract
In video object tracking, there exist rich temporal contexts among successive frames, which have been largely overlooked in existing trackers. In this work, we bridge the individual video frames and explore the temporal contexts across them via a transformer architecture for robust object tracking. Different from classic usage of the transformer in natural language processing tasks, we separate its encoder and decoder into two parallel branches and carefully design them within the Siamese-like tracking pipelines. The transformer encoder promotes the target templates via attention-based feature reinforcement, which benefits the high-quality tracking model generation. The transformer decoder propagates the tracking cues from previous templates to the current frame, which facilitates the object searching process. Our transformer-assisted tracking framework is neat and trained in an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · IoT-based Smart Home Systems · Human Pose and Action Recognition
