Learning Spatial-Frequency Transformer for Visual Object Tracking
Chuanming Tang, Xiao Wang, Yuanchao Bai, Zhe Wu, Jianlin Zhang,, Yongmei Huang

TL;DR
This paper introduces a novel Spatial-Frequency Transformer for visual object tracking that models spatial priors and high-frequency features, leading to improved tracking accuracy in various scenarios.
Contribution
It proposes a unified Spatial-Frequency Transformer with Gaussian spatial prior and high-frequency emphasis attention, integrated into a Siamese tracking framework for enhanced performance.
Findings
Effective in short-term and long-term tracking benchmarks.
Protects high-frequency features through all-pass filtering.
Outperforms existing methods in accuracy and robustness.
Abstract
Recent trackers adopt the Transformer to combine or replace the widely used ResNet as their new backbone network. Although their trackers work well in regular scenarios, however, they simply flatten the 2D features into a sequence to better match the Transformer. We believe these operations ignore the spatial prior of the target object which may lead to sub-optimal results only. In addition, many works demonstrate that self-attention is actually a low-pass filter, which is independent of input features or key/queries. That is to say, it may suppress the high-frequency component of the input features and preserve or even amplify the low-frequency information. To handle these issues, in this paper, we propose a unified Spatial-Frequency Transformer that models the Gaussian spatial Prior and High-frequency emphasis Attention (GPHA) simultaneously. To be specific, Gaussian spatial prior is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Chemical Sensor Technologies · Infrared Target Detection Methodologies
MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · 1x1 Convolution · Batch Normalization · Dense Connections · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection · Dropout
