Exploring Lightweight Hierarchical Vision Transformers for Efficient   Visual Tracking

Ben Kang; Xin Chen; Dong Wang; Houwen Peng; Huchuan Lu

arXiv:2308.06904·cs.CV·August 15, 2023·6 cites

Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking

Ben Kang, Xin Chen, Dong Wang, Houwen Peng, Huchuan Lu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces HiT, a lightweight hierarchical vision transformer for visual tracking that achieves high speed and competitive accuracy on edge devices by using a novel Bridge Module and dual-image position encoding.

Contribution

The paper proposes HiT, a new efficient tracking model with a Bridge Module and dual-image position encoding, enabling high-speed performance on limited hardware.

Findings

01

Runs at 61 fps on Nvidia Jetson AGX

02

Achieves 64.6% AUC on LaSOT benchmark

03

Surpasses previous efficient trackers in accuracy

Abstract

Transformer-based visual trackers have demonstrated significant progress owing to their superior modeling capabilities. However, existing trackers are hampered by low speed, limiting their applicability on devices with limited computational power. To alleviate this problem, we propose HiT, a new family of efficient tracking models that can run at high speed on different devices while retaining high performance. The central idea of HiT is the Bridge Module, which bridges the gap between modern lightweight transformers and the tracking framework. The Bridge Module incorporates the high-level information of deep features into the shallow large-resolution features. In this way, it produces better features for the tracking head. We also propose a novel dual-image position encoding technique that simultaneously encodes the position information of both the search region and template images.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kangben258/hit
pytorchOfficial

Videos

Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking· youtube

Taxonomy

TopicsVisual Attention and Saliency Detection · Gaze Tracking and Assistive Technology · CCD and CMOS Imaging Sensors

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings