Towards Real-World Visual Tracking with Temporal Contexts
Ziang Cao, Ziyuan Huang, Liang Pan, Shiwei Zhang, Ziwei Liu, Changhong, Fu

TL;DR
This paper introduces TCTrack++, a novel visual tracking framework that effectively exploits temporal contexts through attention-based convolution and adaptive transformers, significantly improving real-world tracking performance.
Contribution
It proposes a two-level framework with temporal context exploitation via attention-based convolution and adaptive transformers, addressing real-world tracking challenges.
Findings
Outperforms state-of-the-art trackers on 8 benchmarks.
Demonstrates robustness in real-world conditions.
Enhances tracking accuracy with temporal context integration.
Abstract
Visual tracking has made significant improvements in the past few decades. Most existing state-of-the-art trackers 1) merely aim for performance in ideal conditions while overlooking the real-world conditions; 2) adopt the tracking-by-detection paradigm, neglecting rich temporal contexts; 3) only integrate the temporal information into the template, where temporal contexts among consecutive frames are far from being fully utilized. To handle those problems, we propose a two-level framework (TCTrack) that can exploit temporal contexts efficiently. Based on it, we propose a stronger version for real-world visual tracking, i.e., TCTrack++. It boils down to two levels: features and similarity maps. Specifically, for feature extraction, we propose an attention-based temporally adaptive convolution to enhance the spatial features using temporal information, which is achieved by dynamically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods
MethodsConvolution
