Hybrid Convolution and Vision Transformer NAS Search Space for TinyML Image Classification
Mikhael Djajapermana, Moritz Reiber, Daniel Mueller-Gritschneder, Ulf Schlichtmann

TL;DR
This paper proposes a new hybrid CNN-ViT search space for NAS to develop efficient image classification models suitable for tinyML, balancing accuracy and computational constraints.
Contribution
A novel hybrid CNN-ViT search space for NAS that includes local, global, and pooling blocks tailored for tinyML deployment.
Findings
Achieved superior accuracy compared to ResNet-based tinyML models.
Produced architectures with faster inference speeds under size constraints.
Demonstrated effectiveness on CIFAR10 dataset.
Abstract
Hybrids of Convolutional Neural Network (CNN) and Vision Transformer (ViT) have outperformed pure CNN or ViT architecture. However, since these architectures require large parameters and incur large computational costs, they are unsuitable for tinyML deployment. This paper introduces a new hybrid CNN-ViT search space for Neural Architecture Search (NAS) to find efficient hybrid architectures for image classification. The search space covers hybrid CNN and ViT blocks to learn local and global information, as well as the novel Pooling block of searchable pooling layers for efficient feature map reduction. Experimental results on the CIFAR10 dataset show that our proposed search space can produce hybrid CNN-ViT architectures with superior accuracy and inference speed to ResNet-based tinyML models under tight model size constraints.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Big Data and Digital Economy
