SparseTT: Visual Tracking with Sparse Transformers
Zhihong Fu, Zehua Fu, Qingjie Liu, Wenrui Cai, Yunhong Wang

TL;DR
SparseTT introduces a sparse attention mechanism and a double-head predictor to enhance visual tracking accuracy, outperforming state-of-the-art methods while reducing training time and maintaining real-time speed.
Contribution
The paper proposes a novel sparse attention mechanism and a double-head predictor for improved accuracy and efficiency in visual tracking with Transformers.
Findings
Outperforms state-of-the-art on multiple benchmarks
Runs at 40 FPS with reduced training time
Significantly improves tracking accuracy
Abstract
Transformers have been successfully applied to the visual tracking task and significantly promote tracking performance. The self-attention mechanism designed to model long-range dependencies is the key to the success of Transformers. However, self-attention lacks focusing on the most relevant information in the search regions, making it easy to be distracted by background. In this paper, we relieve this issue with a sparse attention mechanism by focusing the most relevant information in the search regions, which enables a much accurate tracking. Furthermore, we introduce a double-head predictor to boost the accuracy of foreground-background classification and regression of target bounding boxes, which further improve the tracking performance. Extensive experiments show that, without bells and whistles, our method significantly outperforms the state-of-the-art approaches on LaSOT,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Impact of Light on Environment and Health · Air Quality Monitoring and Forecasting
