Remote Sensing Scene Classification with Masked Image Modeling (MIM)
Liya Wang, Alex Tien

TL;DR
This paper demonstrates that Masked Image Modeling (MIM) pretraining significantly improves remote sensing scene classification accuracy using Vision Transformers, outperforming supervised learning methods and rivaling specialized models.
Contribution
It is the first to systematically evaluate MIM pretraining for remote sensing scene classification, showing substantial performance gains over supervised learning and competitive results with specialized models.
Findings
MIM-pretrained ViTs outperform supervised counterparts by up to 5% accuracy.
MIM pretraining improves accuracy by up to 18% on top-1 metrics.
MIM-pretrained ViTs achieve performance comparable to specialized Transformer models.
Abstract
Remote sensing scene classification has been extensively studied for its critical roles in geological survey, oil exploration, traffic management, earthquake prediction, wildfire monitoring, and intelligence monitoring. In the past, the Machine Learning (ML) methods for performing the task mainly used the backbones pretrained in the manner of supervised learning (SL). As Masked Image Modeling (MIM), a self-supervised learning (SSL) technique, has been shown as a better way for learning visual feature representation, it presents a new opportunity for improving ML performance on the scene classification task. This research aims to explore the potential of MIM pretrained backbones on four well-known classification datasets: Merced, AID, NWPU-RESISC45, and Optimal-31. Compared to the published benchmarks, we show that the MIM pretrained Vision Transformer (ViTs) backbones outperform other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote-Sensing Image Classification · Advanced Image and Video Retrieval Techniques · Remote Sensing and Land Use
MethodsAttention Is All You Need · Absolute Position Encodings · Label Smoothing · Softmax · Adam · Layer Normalization · Residual Connection · Dense Connections · Linear Layer · Dropout
