Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking

Ben Kang; Xin Chen; Jie Zhao; Chunjuan Bo; Dong Wang; Huchuan Lu

arXiv:2506.20381·cs.CV·June 26, 2025

Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking

Ben Kang, Xin Chen, Jie Zhao, Chunjuan Bo, Dong Wang, Huchuan Lu

PDF

Open Access 1 Repo

TL;DR

This paper introduces HiT and DyHiT, efficient hierarchical transformer-based visual trackers that achieve high speed and accuracy on resource-limited devices through innovative modules and dynamic routing strategies.

Contribution

The paper presents a novel lightweight hierarchical ViT framework with a bridge module and dual-image encoding, along with a dynamic routing approach for adaptive tracking efficiency.

Findings

01

HiT achieves 61 fps on NVIDIA Jetson AGX with 64.6% AUC on LaSOT.

02

DyHiT adapts to scene complexity, reaching 111 fps with 62.4% AUC.

03

The training-free acceleration boosts existing trackers' speed by 2.68 times.

Abstract

Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities. However, their practicality is limited on resource-constrained devices because of their slow processing speeds. To address this challenge, we present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various devices. The core innovation of HiT lies in its Bridge Module, which connects lightweight transformers to the tracking framework, enhancing feature representation quality. Additionally, we introduce a dual-image position encoding approach to effectively encode spatial information. HiT achieves an impressive speed of 61 frames per second (fps) on the NVIDIA Jetson AGX platform, alongside a competitive AUC of 64.6% on the LaSOT benchmark, outperforming all previous efficient trackers.Building on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kangben258/hit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings