Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery

Ashim Dahal; Saydul Akbar Murad; and Nick Rahimi

arXiv:2411.09101·cs.CV·May 19, 2025

Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery

Ashim Dahal, Saydul Akbar Murad, and Nick Rahimi

PDF

Open Access 1 Repo

TL;DR

This paper compares Vision Transformers and CNNs for semantic segmentation of remote sensing images, highlighting the impact of a weighted loss function and transfer learning on model performance and efficiency.

Contribution

It introduces a heuristic analysis of ViT versus CNN models, emphasizing the effects of a weighted loss function and transfer learning on segmentation accuracy.

Findings

01

Weighted fused loss improves CNN performance significantly.

02

CNN with weighted loss outperforms ViT in segmentation metrics.

03

Trade-offs identified between model accuracy and inference time.

Abstract

Vision Transformers (ViT) have recently brought a new wave of research in the field of computer vision. These models have performed particularly well in image classification and segmentation. Research on semantic and instance segmentation has accelerated with the introduction of the new architecture, with over 80% of the top 20 benchmarks for the iSAID dataset based on either the ViT architecture or the attention mechanism behind its success. This paper focuses on the heuristic comparison of three key factors of using (or not using) ViT for semantic segmentation of remote sensing aerial images on the iSAID dataset. The experimental results observed during this research were analyzed based on three objectives. First, we studied the use of a weighted fused loss function to maximize the mean Intersection over Union (mIoU) score and Dice score while minimizing entropy or class…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ashimdahal/vit-vs-cnn-image-segmentation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRemote-Sensing Image Classification

MethodsSoftmax · Attention Is All You Need