Lightweight, Pre-trained Transformers for Remote Sensing Timeseries
Gabriel Tseng, Ruben Cartuyvels, Ivan Zvonkov, Mirali Purohit, David, Rolnick, Hannah Kerner

TL;DR
Presto is a lightweight, pre-trained transformer model tailored for remote sensing time series data, enabling effective transfer learning and feature extraction with less computational resources.
Contribution
The paper introduces Presto, a novel pre-trained transformer specifically designed for remote sensing time series data, improving efficiency and performance over larger models.
Findings
Presto performs competitively on various remote sensing tasks.
Presto requires significantly less compute than larger models.
Presto is effective for transfer learning and feature extraction.
Abstract
Machine learning methods for satellite data have a range of societally relevant applications, but labels used to train models can be difficult or impossible to acquire. Self-supervision is a natural solution in settings with limited labeled data, but current self-supervised models for satellite data fail to take advantage of the characteristics of that data, including the temporal dimension (which is critical for many applications, such as monitoring crop growth) and availability of data from many complementary sensors (which can significantly improve a model's predictive performance). We present Presto (the Pretrained Remote Sensing Transformer), a model pre-trained on remote sensing pixel-timeseries data. By designing Presto specifically for remote sensing data, we can create a significantly smaller but performant model. Presto excels at a wide variety of globally distributed remote…
Peer Reviews
Decision·Submitted to ICLR 2024
* The article is well written, techincally sound, and easy to follow. * **Reproducibility**. Code have been made available and it is easy to reproduce results. This is also very important to motivate adoption from the community. * **Generizability**. Presto works well for multiple application tasks and was tested in multiple geographic locations.
* **Resolution constraints**. Most downstream tasks Presto is tested on use coarse resolution. It would be nice to test the effect of encoding different resolution imagery during pretraining * **Small performance gains**. The performance improvement for some of the tasks tested compared to simple baselines is limited. * **Set of downstream tasks tested**. It is worrisome to me that most self-supervised methods test their proposed approach againg a different set of downstream task maiking it har
The paper deals with an important issue which is model pretraining using remote sensing data. For remote sensing applications of machine learning there exist plentiful unlabelled data directly from satellite while getting ground truths can be challenging, as a result labelled datasets for remote sensing are typically small and taking advantage of large scale unlabelled data through pretraining could bring significant benefits. As discussed, the masked autoencoder framework is by design suitable
I believe there are several issues with this paper that need to be addressed by the authors. 1) The application of Transformers to remote sensing should be discussed in more detail, e.g. [1] for pixel timeseries classification and [2] image timeseries classification and segmentation (despite operating on image timeseries base their model design on similar points as bullet points 1, 2 mentioned in the introduction). [1] https://www.sciencedirect.com/science/article/pii/S0924271620301647 [2] htt
1. Comprehensive Pretraining Data: The paper mentions that Presto is pre-trained on a diverse range of directly sensed and derived Earth observation products, which can significantly improve model performance. This comprehensive pretraining data helps the model capture a wide range of features and patterns in remote sensing data. 2. Competitive Performance: The paper highlights that Presto achieves state-of-the-art results in a wide variety of globally distributed evaluation tasks. It outperfor
1. Limited Spatial Context: Presto is designed to process pixel-time series data and does not process very high-resolution imagery natively. This limitation may impact its performance on tasks where spatial information is crucial, such as scene classification challenges. Image-based models that can distinguish the shape of relevant pixels may be better suited for such tasks. 2. Lossy Aggregation of Spatial Information: The paper mentions that Presto uses a crude token-aggregation method to repr
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Environmental Monitoring and Data Management · Computational Physics and Python Applications
MethodsTemporal Dropout or TempD · Band Dropout · Attention Is All You Need · Linear Layer · Adam · Layer Normalization · Dense Connections · Label Smoothing · Dropout · Absolute Position Encodings
