CNN and ViT Efficiency Study on Tiny ImageNet and DermaMNIST Datasets

Aidar Amangeldi; Angsar Taigonyrov; Muhammad Huzaifa Jawad; Chinedu Emmanuel Mbonu

arXiv:2505.08259·cs.CV·February 16, 2026

CNN and ViT Efficiency Study on Tiny ImageNet and DermaMNIST Datasets

Aidar Amangeldi, Angsar Taigonyrov, Muhammad Huzaifa Jawad, Chinedu Emmanuel Mbonu

PDF

TL;DR

This paper compares convolutional and transformer-based neural networks on Tiny ImageNet and DermaMNIST, showing that fine-tuned Vision Transformers can achieve comparable or better accuracy with reduced inference time and complexity.

Contribution

It introduces a fine-tuning strategy for Vision Transformers that improves efficiency and performance on medical and general image classification tasks.

Findings

01

Vision Transformers can match or outperform ResNet-18 after fine-tuning.

02

Transformers achieve faster inference with fewer parameters.

03

Fine-tuning enhances transformer efficiency on resource-constrained devices.

Abstract

This study evaluates the trade-offs between convolutional and transformer-based architectures on both medical and general-purpose image classification benchmarks. We use ResNet-18 as our baseline and introduce a fine-tuning strategy applied to four Vision Transformer variants (Tiny, Small, Base, Large) on DermatologyMNIST and TinyImageNet. Our goal is to reduce inference latency and model complexity with acceptable accuracy degradation. Through systematic hyperparameter variations, we demonstrate that appropriately fine-tuned Vision Transformers can match or exceed the baseline's performance, achieve faster inference, and operate with fewer parameters, highlighting their viability for deployment in resource-constrained environments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Vision Transformer · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax