Multi-Modal Vision Transformers for Crop Mapping from Satellite Image Time Series
Theresa Follath, David Mickisch, Jan Hemmerling, Stefan Erasmi, Marcel, Schwieder, Beg\"um Demir

TL;DR
This paper introduces multi-modal transformer architectures for crop mapping from satellite image time series, demonstrating significant improvements over existing methods by effectively integrating multi-sensor data.
Contribution
The paper proposes novel multi-modal multi-temporal transformer models, including Early Fusion, Cross Attention Fusion, and Synchronized Class Token Fusion, advancing crop classification accuracy.
Findings
Significant accuracy improvements over state-of-the-art models.
Effective integration of multi-sensor satellite data.
Validation on crop mapping datasets shows robustness.
Abstract
Using images acquired by different satellite sensors has shown to improve classification performance in the framework of crop mapping from satellite image time series (SITS). Existing state-of-the-art architectures use self-attention mechanisms to process the temporal dimension and convolutions for the spatial dimension of SITS. Motivated by the success of purely attention-based architectures in crop mapping from single-modal SITS, we introduce several multi-modal multi-temporal transformer-based architectures. Specifically, we investigate the effectiveness of Early Fusion, Cross Attention Fusion and Synchronized Class Token Fusion within the Temporo-Spatial Vision Transformer (TSViT). Experimental results demonstrate significant improvements over state-of-the-art architectures with both convolutional and self-attention components.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSmart Agriculture and AI · Remote Sensing in Agriculture · Remote Sensing and Land Use
MethodsAttention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer · Absolute Position Encodings
