Adaptively Bypassing Vision Transformer Blocks for Efficient Visual   Tracking

Xiangyang Yang; Dan Zeng; Xucheng Wang; You Wu; Hengzhou Ye; Qijun; Zhao; and Shuiwang Li

arXiv:2406.08037·cs.CV·July 2, 2024·1 cites

Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

Xiangyang Yang, Dan Zeng, Xucheng Wang, You Wu, Hengzhou Ye, Qijun, Zhao, and Shuiwang Li

PDF

Open Access

TL;DR

ABTrack introduces an adaptive framework that selectively bypasses transformer blocks in visual tracking, significantly improving speed while maintaining state-of-the-art accuracy by dynamically simplifying the model based on scene and target characteristics.

Contribution

The paper proposes a novel adaptive bypass mechanism and a ViT pruning method to enhance the efficiency of transformer-based visual trackers without sacrificing performance.

Findings

01

Achieves state-of-the-art tracking performance on multiple benchmarks.

02

Significantly reduces inference time compared to existing methods.

03

Demonstrates the effectiveness of adaptive bypassing and pruning in real-world scenarios.

Abstract

Empowered by transformer-based models, visual tracking has advanced significantly. However, the slow speed of current trackers limits their applicability on devices with constrained computational resources. To address this challenge, we introduce ABTrack, an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking. The rationale behind ABTrack is rooted in the observation that semantic features or relations do not uniformly impact the tracking task across all abstraction levels. Instead, this impact varies based on the characteristics of the target and the scene it occupies. Consequently, disregarding insignificant semantic features or relations at certain abstraction levels may not significantly affect the tracking accuracy. We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed, which adaptively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pruning