Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification
Arun D. Kulkarni

TL;DR
This paper compares Vision Transformers and CNNs for land use scene classification in remote sensing, highlighting their respective strengths, limitations, and suitability depending on data size and scene complexity.
Contribution
It provides a comprehensive performance comparison of ViTs and CNNs on benchmark datasets, offering insights into their advantages and limitations for remote sensing applications.
Findings
CNNs perform well with limited data and local textures
ViTs excel in capturing global relationships with sufficient data
ViTs require more computational resources and larger datasets
Abstract
Land Use Scene Classification (LUSC) from remote sensing imagery plays a critical role in environmental monitoring, urban planning, and sustainable resource management. In recent years, deep learning methods have significantly advanced the state of the art, with Convolutional Neural Networks (CNNs) dominating the field because of their strong ability to capture local spatial features. However, the emergence of Vision Transformers (ViTs) has introduced a new paradigm that models long-range dependencies through self-attention mechanisms, potentially enabling improved global context understanding. This paper presents a comparative assessment of Vision Transformers and CNN-based architecture for remote sensing land use scene classification. Representative CNN models, such as AlexNet, is evaluated alongside the Vision Transformer (ViT) using benchmark remote sensing datasets, including the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
