ViTMAlis: Towards Latency-Critical Mobile Video Analytics with Vision Transformers
Miao Zhang, Guanzhen Wu, Hao Fang, Yifei Zhu, Fangxin Wang, Ruixiao Zhang, and Jiangchuan Liu

TL;DR
ViTMAlis is a framework that optimizes latency-critical mobile video analytics using vision transformers by dynamically balancing resolution, transmission, and inference to reduce delays and improve accuracy.
Contribution
It introduces a novel dynamic mixed-resolution inference strategy and a ViT-native offloading framework tailored for latency-critical dense prediction tasks.
Findings
Significantly reduces end-to-end offloading latency.
Improves user-perceived rendering accuracy.
Outperforms state-of-the-art latency-adaptive baselines.
Abstract
Edge-assisted mobile video analytics (MVA) applications are increasingly shifting from using vision models based on convolutional neural networks (CNNs) to those built on vision transformers (ViTs) to leverage their superior global context modeling and generalization capabilities. However, deploying these advanced models in latency-critical MVA scenarios presents significant challenges. Unlike traditional CNN-based offloading paradigms where network transmission is the primary bottleneck, ViT-based systems are constrained by substantial inference delays, particularly for dense prediction tasks where the need for high-resolution inputs exacerbates the inherent quadratic computational complexity of ViTs. To address these challenges, we propose a dynamic mixed-resolution inference strategy tailored for ViT-backboned dense prediction models, enabling flexible runtime trade-offs between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Image and Video Quality Assessment · Advanced Neural Network Applications
