A Comparative Study of Transformer and Convolutional Models for Crop Segmentation from Satellite Image Time Series

Mattia Gatti; Ignazio Gallo; Nicola Landro; Christian Loschiavo; Anwar Ur Rehman; Mirco Boschetti; Riccardo La Grassa

arXiv:2412.01944·cs.CV·May 21, 2026

A Comparative Study of Transformer and Convolutional Models for Crop Segmentation from Satellite Image Time Series

Mattia Gatti, Ignazio Gallo, Nicola Landro, Christian Loschiavo, Anwar Ur Rehman, Mirco Boschetti, Riccardo La Grassa

PDF

TL;DR

This study compares transformer and convolutional models for crop segmentation from satellite image time series, highlighting the importance of temporal modeling and evaluating various architectures on real datasets.

Contribution

It provides a comprehensive comparison of CNN and transformer-based models for crop segmentation, introducing insights into their effectiveness and efficiency in processing satellite time series data.

Findings

01

TSViT achieves the best overall results, slightly surpassing 3D U-Net.

02

VistaFormer offers the best efficiency among tested models.

03

Temporal modeling is critical for effective crop segmentation from satellite data.

Abstract

Crop segmentation from satellite image time series (SITS) is a fundamental task for agricultural monitoring and land-use analysis. While convolutional neural networks (CNNs) have been widely used, transformer-based architectures offer alternative mechanisms for representing spatial and temporal dependencies in multispectral data. This paper presents a comparative study of CNN and transformer-based segmentation models for crop mapping from Sentinel-2 time series, including 3D U-Net, 3D FPN, 3D DeepLabv3, and three transformer architectures: Swin UNETR, TSViT, and VistaFormer, which adopt different strategies for capturing temporal dependencies. Experiments on the Munich and Lombardia datasets show that TSViT achieves the best overall results, slightly surpassing 3D U-Net, which remains a strong CNN baseline. VistaFormer offers the best efficiency, while Swin UNETR performs competitively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.