Multi-Modal Vision Transformers for Crop Mapping from Satellite Image   Time Series

Theresa Follath; David Mickisch; Jan Hemmerling; Stefan Erasmi; Marcel; Schwieder; Beg\"um Demir

arXiv:2406.16513·cs.CV·June 25, 2024·1 cites

Multi-Modal Vision Transformers for Crop Mapping from Satellite Image Time Series

Theresa Follath, David Mickisch, Jan Hemmerling, Stefan Erasmi, Marcel, Schwieder, Beg\"um Demir

PDF

Open Access

TL;DR

This paper introduces multi-modal transformer architectures for crop mapping from satellite image time series, demonstrating significant improvements over existing methods by effectively integrating multi-sensor data.

Contribution

The paper proposes novel multi-modal multi-temporal transformer models, including Early Fusion, Cross Attention Fusion, and Synchronized Class Token Fusion, advancing crop classification accuracy.

Findings

01

Significant accuracy improvements over state-of-the-art models.

02

Effective integration of multi-sensor satellite data.

03

Validation on crop mapping datasets shows robustness.

Abstract

Using images acquired by different satellite sensors has shown to improve classification performance in the framework of crop mapping from satellite image time series (SITS). Existing state-of-the-art architectures use self-attention mechanisms to process the temporal dimension and convolutions for the spatial dimension of SITS. Motivated by the success of purely attention-based architectures in crop mapping from single-modal SITS, we introduce several multi-modal multi-temporal transformer-based architectures. Specifically, we investigate the effectiveness of Early Fusion, Cross Attention Fusion and Synchronized Class Token Fusion within the Temporo-Spatial Vision Transformer (TSViT). Experimental results demonstrate significant improvements over state-of-the-art architectures with both convolutional and self-attention components.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSmart Agriculture and AI · Remote Sensing in Agriculture · Remote Sensing and Land Use

MethodsAttention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer · Absolute Position Encodings