A Study on Inference Latency for Vision Transformers on Mobile Devices

Zhuojin Li; Marco Paolieri; Leana Golubchik

arXiv:2510.25166·cs.CV·February 20, 2026

A Study on Inference Latency for Vision Transformers on Mobile Devices

Zhuojin Li, Marco Paolieri, Leana Golubchik

PDF

TL;DR

This paper analyzes the inference latency of 190 vision transformers on mobile devices, compares them with CNNs, and creates a dataset to predict ViT latency accurately for practical deployment.

Contribution

It provides a comprehensive performance analysis of ViTs on mobile devices and introduces a dataset for latency prediction of synthetic ViT architectures.

Findings

01

ViTs have higher latency than CNNs on mobile devices

02

Latency can be accurately predicted using the dataset

03

Insights into factors affecting ViT inference speed

Abstract

Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.