Enabling Efficient Hardware Acceleration of Hybrid Vision Transformer (ViT) Networks at the Edge

Joren Dumoulin; Pouya Houshmand; Vikram Jain; Marian Verhelst

arXiv:2507.14651·cs.AR·July 22, 2025

Enabling Efficient Hardware Acceleration of Hybrid Vision Transformer (ViT) Networks at the Edge

Joren Dumoulin, Pouya Houshmand, Vikram Jain, Marian Verhelst

PDF

TL;DR

This paper presents a hardware accelerator design for hybrid vision transformer networks, optimizing execution on resource-limited edge devices through configurable processing elements and advanced scheduling techniques.

Contribution

It introduces a configurable PE array and novel scheduling strategies to efficiently support diverse hybrid ViT layers on edge hardware.

Findings

01

Achieved 1.39 TOPS/W energy efficiency in 28nm CMOS implementation.

02

Supported all hybrid ViT layer types with a configurable PE array.

03

Reduced off-chip memory transfers through layer fusion and optimized scheduling.

Abstract

Hybrid vision transformers combine the elements of conventional neural networks (NN) and vision transformers (ViT) to enable lightweight and accurate detection. However, several challenges remain for their efficient deployment on resource-constrained edge devices. The hybrid models suffer from a widely diverse set of NN layer types and large intermediate data tensors, hampering efficient hardware acceleration. To enable their execution at the edge, this paper proposes innovations across the hardware-scheduling stack: a.) At the lowest level, a configurable PE array supports all hybrid ViT layer types; b.) temporal loop re-ordering within one layer, enabling hardware support for normalization and softmax layers, minimizing on-chip data transfers; c.) further scheduling optimization employs layer fusion across inverted bottleneck layers to drastically reduce off-chip memory transfers. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.