TL;DR
This paper introduces a method to accelerate pretrained Vision Transformer models by replacing certain attention heads with depthwise convolution layers, achieving significant speedups with minimal accuracy loss.
Contribution
The authors propose a novel drop-in depthwise convolution layer to replace specific attention heads in ViTs, along with strategies for identifying replaceable heads and a fine-tuning process.
Findings
Achieves 17-20% inference speedup on image classification and segmentation tasks.
Minimal performance degradation after replacing attention heads with convolution layers.
Validates approach with extensive experiments and efficiency benchmarks.
Abstract
Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
