# A satellite based machine learning approach for estimating high resolution daily average air temperature in a megacity in Brazil

**Authors:** Aina Roca-Barceló, Rochelle Schneider, Monica Pirani, Alessandro Sebastianelli, Frédéric B. Piel, Paolo Vineis, Adelaide Cassia Nardocci, Daniela Fecht

PMC · DOI: 10.1038/s41598-026-35689-x · Scientific Reports · 2026-02-05

## TL;DR

This paper presents a machine learning method to estimate high-resolution daily temperatures in São Paulo, Brazil, using satellite data and ground stations to improve environmental health studies.

## Contribution

A generalizable tree-based machine learning approach for high-resolution temperature estimation in urban areas with sparse monitoring.

## Key findings

- The Random Forest model outperformed traditional methods with RMSERF = 0.80 and R2RF = 0.95.
- The model performed slightly worse in rural areas (R2rural = 0.91) compared to urban areas (R2urban = 0.95).
- The resulting 500 × 500 m temperature dataset is the first of its kind in South America.

## Abstract

Spatiotemporally resolved ambient temperature data are essential for environmental epidemiology, especially in urban areas where temperature can vary sharply over short distances, influencing population exposure. Additionally, heat distribution often reflects built environment patterns and may correlate with existing social and environmental disparities. Continuous temporal records at high spatial resolution are, however, often lacking, especially in low- and middle-income countries. We developed a generalizable tree-based machine learning approach to estimate daily mean temperatures at 500 × 500 m resolution using São Paulo, a megacity in Brazil, as a case study, to demonstrate its utility in highly urbanized settings with a heterogeneous urban fabric and unevenly distributed temperature monitoring stations. We trained a Random Forest model using open-access remote sensing data, along with derived products, and temperature measurements from 43 ground stations. To prevent overfitting and select relevant features, we employed a forward feature selection algorithm with target-oriented (spatial) cross-validation. Hyperparameter tuning was performed using grid search approach. The model was validated through ten-fold station-based cross-validation and an external hold-out dataset. The model demonstrated strong performance (RMSERF = 0.80; R2RF = 0.95), with slightly reduced accuracy in rural areas (R2rural = 0.91; R2urban = 0.95). Compared to traditional multilinear approaches (RMSEMLR = 1.02; R2MLR = 0.92), the Random Forest model outperformed, likely due to its ability to better capture microclimates and complex relationships between data sources. This 500 × 500 m daily temperature dataset is the first of its kind in South America, with the São Paulo pipeline and data freely accessible. The approach is adaptable to other regions with appropriate retraining and validation, enabling high-resolution exposure assessments.

The online version contains supplementary material available at 10.1038/s41598-026-35689-x.

## Full-text entities

- **Diseases:** LMIC (MESH:D010033), FFS (MESH:D009155), LST (MESH:D000377), NDVI (MESH:D018458)
- **Chemicals:** water (MESH:D014867), ECMWF (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12929572/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12929572/full.md

## References

16 references — full list in the complete paper: https://tomesphere.com/paper/PMC12929572/full.md

---
Source: https://tomesphere.com/paper/PMC12929572