MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications
Mustafa Munir, William Avery, Radu Marculescu

TL;DR
MobileViG introduces a novel hybrid CNN-GNN architecture with a sparse attention mechanism, achieving state-of-the-art accuracy and speed on mobile vision tasks like image classification, object detection, and segmentation.
Contribution
The paper proposes the first hybrid CNN-GNN model for mobile vision, utilizing a new sparse attention mechanism (SVGA) to improve efficiency and accuracy.
Findings
MobileViG-Ti achieves 75.7% top-1 accuracy on ImageNet-1K.
MobileViG-B attains 82.6% top-1 accuracy with 2.30 ms latency.
MobileViG outperforms existing mobile CNN and ViG models in speed and accuracy.
Abstract
Traditionally, convolutional neural networks (CNN) and vision transformers (ViT) have dominated computer vision. However, recently proposed vision graph neural networks (ViG) provide a new avenue for exploration. Unfortunately, for mobile applications, ViGs are computationally expensive due to the overhead of representing images as graph structures. In this work, we propose a new graph-based sparse attention mechanism, Sparse Vision Graph Attention (SVGA), that is designed for ViGs running on mobile devices. Additionally, we propose the first hybrid CNN-GNN architecture for vision tasks on mobile devices, MobileViG, which uses SVGA. Extensive experiments show that MobileViG beats existing ViG models and existing mobile CNN and ViT architectures in terms of accuracy and/or speed on image classification, object detection, and instance segmentation tasks. Our fastest model, MobileViG-Ti,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Visual Attention and Saliency Detection
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
