TL;DR
ETCTrack introduces a dynamic token compression framework for visual tracking that reduces computational costs while maintaining high accuracy by filtering redundant features and enabling adaptive interaction.
Contribution
The paper proposes a novel compress-then-interact framework with an Adaptive Token Compressor and Hierarchical Interaction Encoder for efficient, high-performance visual tracking.
Findings
Reduces template tokens by 60%
Achieves 21.4% reduction in MACs
Only 0.4% accuracy drop on benchmarks
Abstract
Refining visual representations by eliminating their internal feature-level redundancy is crucial for simultaneously optimizing the performance and computational cost of models in visual tracking. To enhance their performance, many contemporary Transformer-based trackers leverage a larger number of historical template frames to capture richer spatio-temporal cues. However, this strategy leads to a massive number of input visual tokens. This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker's overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. Our method first employs the Adaptive Token Compressor to dynamically construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
