Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification

Arun D. Kulkarni

arXiv:2605.21268·cs.CV·May 21, 2026

Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification

Arun D. Kulkarni

PDF

TL;DR

This paper compares Vision Transformers and CNNs for land use scene classification in remote sensing, highlighting their respective strengths, limitations, and suitability depending on data size and scene complexity.

Contribution

It provides a comprehensive performance comparison of ViTs and CNNs on benchmark datasets, offering insights into their advantages and limitations for remote sensing applications.

Findings

01

CNNs perform well with limited data and local textures

02

ViTs excel in capturing global relationships with sufficient data

03

ViTs require more computational resources and larger datasets

Abstract

Land Use Scene Classification (LUSC) from remote sensing imagery plays a critical role in environmental monitoring, urban planning, and sustainable resource management. In recent years, deep learning methods have significantly advanced the state of the art, with Convolutional Neural Networks (CNNs) dominating the field because of their strong ability to capture local spatial features. However, the emergence of Vision Transformers (ViTs) has introduced a new paradigm that models long-range dependencies through self-attention mechanisms, potentially enabling improved global context understanding. This paper presents a comparative assessment of Vision Transformers and CNN-based architecture for remote sensing land use scene classification. Representative CNN models, such as AlexNet, is evaluated alongside the Vision Transformer (ViT) using benchmark remote sensing datasets, including the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.