UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking
Hao Wu, Xudong Wang, Jialiang Zhang, Junlong Tong, Xinghao Chen, Junyan Lin, Yunpu Ma, Xiaoyu Shen

TL;DR
UTPTrack introduces a unified token pruning framework for Transformer-based visual trackers, jointly compressing search, dynamic, and static templates to significantly improve efficiency while maintaining high accuracy across multiple benchmarks.
Contribution
It is the first to jointly prune all three key components in Transformer-based trackers, enabling more efficient and accurate unified visual tracking.
Findings
Prunes 65.4% of tokens in RGB tracking with minimal performance loss.
Achieves state-of-the-art accuracy-efficiency trade-off on 10 benchmarks.
Supports multimodal and language-guided tracking within a single unified model.
Abstract
One-stream Transformer-based trackers achieve advanced performance in visual object tracking but suffer from significant computational overhead that hinders real-time deployment. While token pruning offers a path to efficiency, existing methods are fragmented. They typically prune the search region, dynamic template, and static template in isolation, overlooking critical inter-component dependencies, which yields suboptimal pruning and degraded accuracy. To address this, we introduce UTPTrack, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multimodal and language-guided tasks within a single model. Extensive evaluations on 10 benchmarks demonstrate that UTPTrack…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Gaze Tracking and Assistive Technology · Human Pose and Action Recognition
