Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking

Patrick Poggi; Divake Kumar; Theja Tulabandhula; Amit Ranjan Trivedi

arXiv:2602.16160·cs.CV·February 23, 2026

Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking

Patrick Poggi, Divake Kumar, Theja Tulabandhula, Amit Ranjan Trivedi

PDF

Open Access

TL;DR

This paper introduces UncL-STARK, a method for transformer-based visual tracking that dynamically adjusts inference depth based on uncertainty, significantly reducing computational cost while maintaining high accuracy.

Contribution

It presents a novel uncertainty-aware depth adaptation technique for transformer trackers that does not alter the original network architecture.

Findings

01

Up to 12% reduction in GFLOPs

02

8.9% decrease in latency

03

10.8% energy savings

Abstract

Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Face recognition and analysis