Efficient Spatio-Temporal Vegetation Pixel Classification with Vision Transformers

Alan Gomes; Anderson Gon\c{c}alves; Samuel Felipe dos Santos; Nathan Felipe Alves; Magna Soelma Beserra de Moura; Bruna de Costa Alberton; Leonor Patricia C. Morellato; Ricardo da Silva Torres; Jurandy Almeida

arXiv:2605.00296·cs.CV·May 4, 2026

Efficient Spatio-Temporal Vegetation Pixel Classification with Vision Transformers

Alan Gomes, Anderson Gon\c{c}alves, Samuel Felipe dos Santos, Nathan Felipe Alves, Magna Soelma Beserra de Moura, Bruna de Costa Alberton, Leonor Patricia C. Morellato, Ricardo da Silva Torres, Jurandy Almeida

PDF

TL;DR

This paper optimizes Vision Transformers for efficient plant pixel classification over time, significantly reducing computational costs while maintaining accuracy in high-resolution vegetation monitoring.

Contribution

It provides a comprehensive ablation study on ViT design choices and demonstrates their effectiveness for scalable, resource-efficient spatio-temporal vegetation analysis.

Findings

01

ViT reduces FLOPs by an order of magnitude compared to CNNs.

02

ViT maintains constant parameter complexity regardless of time series length.

03

Experimental results show competitive classification performance on Brazilian Cerrado datasets.

Abstract

Plant phenology-the study of recurrent life cycle events-is essential for understanding ecosystem dynamics and their responses to climate change impacts. While Unmanned Aerial Vehicles (UAVs) and near-surface cameras enable high-resolution monitoring, identifying plant species across time remains computationally challenging. State-of-the-art approaches, specifically Multi-Temporal Convolutional Networks (CNNs), rely on rigid multi-branch architectures that scale poorly with longer time series and require large spatial context windows. In this paper, we present an extensive study on optimizing Vision Transformers (ViTs) for efficient spatio-temporal vegetation pixel classification. We conducted a comprehensive ablation study analyzing seven key design dimensions, including: (i) data normalization; (ii) spectral arrangement; (iii) boundary handling; (iv) spatial context window shape and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.