Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

Carmelo Scribano; Mohammad Mahdi; Nedyalko Prisadnikov; Yuqian Fu; Giorgia Franchini; Danda Pani Paudel; Marko Bertogna; Luc Van Gool

arXiv:2605.22132·cs.CV·May 22, 2026

Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

Carmelo Scribano, Mohammad Mahdi, Nedyalko Prisadnikov, Yuqian Fu, Giorgia Franchini, Danda Pani Paudel, Marko Bertogna, Luc Van Gool

PDF

1 Repo

TL;DR

This paper introduces a method to accelerate pretrained Vision Transformer models by replacing certain attention heads with depthwise convolution layers, achieving significant speedups with minimal accuracy loss.

Contribution

The authors propose a novel drop-in depthwise convolution layer to replace specific attention heads in ViTs, along with strategies for identifying replaceable heads and a fine-tuning process.

Findings

01

Achieves 17-20% inference speedup on image classification and segmentation tasks.

02

Minimal performance degradation after replacing attention heads with convolution layers.

03

Validates approach with extensive experiments and efficiency benchmarks.

Abstract

Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.