RapidNet: Multi-Level Dilated Convolution Based Mobile Backbone
Mustafa Munir, Md Mostafijur Rahman, Radu Marculescu

TL;DR
RapidNet introduces multi-level dilated convolutions to create a purely CNN-based mobile backbone that surpasses state-of-the-art models in accuracy and speed for various vision tasks on mobile devices.
Contribution
This work proposes a novel multi-level dilated convolution approach for CNNs, enabling larger receptive fields and better feature interaction, leading to superior mobile vision model performance.
Findings
Outperforms SOTA mobile CNN, ViT, ViG, and hybrid models in accuracy and speed.
RapidNet-Ti achieves 76.3% top-1 accuracy on ImageNet-1K with 0.9 ms latency.
Pure CNN architectures can surpass hybrid and ViT models when properly designed.
Abstract
Vision transformers (ViTs) have dominated computer vision in recent years. However, ViTs are computationally expensive and not well suited for mobile devices; this led to the prevalence of convolutional neural network (CNN) and ViT-based hybrid models for mobile vision applications. Recently, Vision GNN (ViG) and CNN hybrid models have also been proposed for mobile vision tasks. However, all of these methods remain slower compared to pure CNN-based models. In this work, we propose Multi-Level Dilated Convolutions to devise a purely CNN-based mobile backbone. Using Multi-Level Dilated Convolutions allows for a larger theoretical receptive field than standard convolutions. Different levels of dilation also allow for interactions between the short-range and long-range features in an image. Experiments show that our proposed model outperforms state-of-the-art (SOTA) mobile CNN, ViT, ViG,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEnergy Efficient Wireless Sensor Networks
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
