Onboard Satellite Image Classification for Earth Observation: A   Comparative Study of ViT Models

Thanh-Dung Le; Vu Nguyen Ha; Ti Ti Nguyen; Geoffrey Eappen; Prabhu; Thiruvasagam; Hong-fu Chou; Duc-Dung Tran; Hung Nguyen-Kha; Luis M.; Garces-Socarras; Jorge L. Gonzalez-Rios; Juan Carlos Merlano-Duncan; Symeon; Chatzinotas

arXiv:2409.03901·cs.CV·April 23, 2025

Onboard Satellite Image Classification for Earth Observation: A Comparative Study of ViT Models

Thanh-Dung Le, Vu Nguyen Ha, Ti Ti Nguyen, Geoffrey Eappen, Prabhu, Thiruvasagam, Hong-fu Chou, Duc-Dung Tran, Hung Nguyen-Kha, Luis M., Garces-Socarras, Jorge L. Gonzalez-Rios, Juan Carlos Merlano-Duncan, Symeon, Chatzinotas

PDF

Open Access 1 Repo

TL;DR

This paper compares various pre-trained vision Transformer models for onboard satellite land use classification, finding EfficientViT-M2 to be the most accurate, efficient, and robust model suitable for satellite Earth observation tasks.

Contribution

It provides a comprehensive comparison of ViT models for onboard satellite image classification, highlighting EfficientViT-M2 as the optimal choice for accuracy and energy efficiency.

Findings

01

EfficientViT-M2 achieves 98.76% accuracy, precision, and recall.

02

EfficientViT-M2 reduces power consumption by over 63% compared to other models.

03

Pre-trained ViT models outperform traditional CNN and ResNet models in this context.

Abstract

This study focuses on identifying the most effective pre-trained model for land use classification in onboard satellite processing, emphasizing achieving high accuracy, computational efficiency, and robustness against noisy data conditions commonly encountered during satellite-based inference. Through extensive experimentation, we compare the performance of traditional CNN-based, ResNet-based, and various pre-trained vision Transformer models. Our findings demonstrate that pre-trained Vision Transformer (ViT) models, particularly MobileViTV2 and EfficientViT-M2, outperform models trained from scratch in terms of accuracy and efficiency. These models achieve high performance with reduced computational requirements and exhibit greater resilience during inference under noisy conditions. While MobileViTV2 has excelled on clean validation data, EfficientViT-M2 has proved more robust when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ltdung/snt-sentry
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRemote Sensing and Land Use · Remote-Sensing Image Classification · Advanced Computational Techniques and Applications

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Vision Transformer · Softmax · Label Smoothing · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer