Deepfake Geography: Detecting AI-Generated Satellite Images
Mansur Yerzhanuly

TL;DR
This paper compares CNNs and Vision Transformers for detecting AI-generated satellite images, demonstrating ViTs' superior accuracy and robustness, and enhancing model interpretability to improve detection trustworthiness.
Contribution
It provides a comprehensive evaluation of ViTs versus CNNs for satellite deepfake detection, introducing interpretability methods specific to each architecture.
Findings
ViTs achieve 95.11% accuracy, outperforming CNNs at 87.02%.
ViTs are more robust in detecting synthetic imagery.
Interpretability methods reveal distinct detection behaviors.
Abstract
The rapid advancement of generative models such as StyleGAN2 and Stable Diffusion poses a growing threat to the authenticity of satellite imagery, which is increasingly vital for reliable analysis and decision-making across scientific and security domains. While deepfake detection has been extensively studied in facial contexts, satellite imagery presents distinct challenges, including terrain-level inconsistencies and structural artifacts. In this study, we conduct a comprehensive comparison between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for detecting AI-generated satellite images. Using a curated dataset of over 130,000 labeled RGB images from the DM-AER and FSI datasets, we show that ViTs significantly outperform CNNs in both accuracy (95.11 percent vs. 87.02 percent) and overall robustness, owing to their ability to model long-range dependencies and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote-Sensing Image Classification · Advanced Neural Network Applications · Face recognition and analysis
