Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking
Xiangyang Yang, Dan Zeng, Xucheng Wang, You Wu, Hengzhou Ye, Qijun, Zhao, and Shuiwang Li

TL;DR
ABTrack introduces an adaptive framework that selectively bypasses transformer blocks in visual tracking, significantly improving speed while maintaining state-of-the-art accuracy by dynamically simplifying the model based on scene and target characteristics.
Contribution
The paper proposes a novel adaptive bypass mechanism and a ViT pruning method to enhance the efficiency of transformer-based visual trackers without sacrificing performance.
Findings
Achieves state-of-the-art tracking performance on multiple benchmarks.
Significantly reduces inference time compared to existing methods.
Demonstrates the effectiveness of adaptive bypassing and pruning in real-world scenarios.
Abstract
Empowered by transformer-based models, visual tracking has advanced significantly. However, the slow speed of current trackers limits their applicability on devices with constrained computational resources. To address this challenge, we introduce ABTrack, an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking. The rationale behind ABTrack is rooted in the observation that semantic features or relations do not uniformly impact the tracking task across all abstraction levels. Instead, this impact varies based on the characteristics of the target and the scene it occupies. Consequently, disregarding insignificant semantic features or relations at certain abstraction levels may not significantly affect the tracking accuracy. We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed, which adaptively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pruning
