Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking
Patrick Poggi, Divake Kumar, Theja Tulabandhula, Amit Ranjan Trivedi

TL;DR
This paper introduces UncL-STARK, a method for transformer-based visual tracking that dynamically adjusts inference depth based on uncertainty, significantly reducing computational cost while maintaining high accuracy.
Contribution
It presents a novel uncertainty-aware depth adaptation technique for transformer trackers that does not alter the original network architecture.
Findings
Up to 12% reduction in GFLOPs
8.9% decrease in latency
10.8% energy savings
Abstract
Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Face recognition and analysis
