Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking
Ben Kang, Xin Chen, Jie Zhao, Chunjuan Bo, Dong Wang, Huchuan Lu

TL;DR
This paper introduces HiT and DyHiT, efficient hierarchical transformer-based visual trackers that achieve high speed and accuracy on resource-limited devices through innovative modules and dynamic routing strategies.
Contribution
The paper presents a novel lightweight hierarchical ViT framework with a bridge module and dual-image encoding, along with a dynamic routing approach for adaptive tracking efficiency.
Findings
HiT achieves 61 fps on NVIDIA Jetson AGX with 64.6% AUC on LaSOT.
DyHiT adapts to scene complexity, reaching 111 fps with 62.4% AUC.
The training-free acceleration boosts existing trackers' speed by 2.68 times.
Abstract
Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities. However, their practicality is limited on resource-constrained devices because of their slow processing speeds. To address this challenge, we present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various devices. The core innovation of HiT lies in its Bridge Module, which connects lightweight transformers to the tracking framework, enhancing feature representation quality. Additionally, we introduce a dual-image position encoding approach to effectively encode spatial information. HiT achieves an impressive speed of 61 frames per second (fps) on the NVIDIA Jetson AGX platform, alongside a competitive AUC of 64.6% on the LaSOT benchmark, outperforming all previous efficient trackers.Building on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
