Vision Transformers for Multi-Variable Climate Downscaling: Emulating Regional Climate Models with a Shared Encoder and Multi-Decoder Architecture
Fabio Merizzi, Harilaos Loukos

TL;DR
This paper introduces a multi-variable Vision Transformer architecture for regional climate downscaling that improves accuracy and reduces computational costs by jointly modeling six climate variables from GCM data.
Contribution
The paper presents a novel multi-variable ViT with shared encoder and variable-specific decoders, outperforming single-variable models and other baselines in climate downscaling tasks.
Findings
Average MSE reduced by 5.5% compared to single-variable models
Achieved 29-32% lower inference time per variable
Outperformed alternative multi-variable baselines
Abstract
Global Climate Models (GCMs) are critical for simulating large-scale climate dynamics, but their coarse spatial resolution limits their applicability in regional studies. Regional Climate Models (RCMs) address this limitation through dynamical downscaling, albeit at considerable computational cost and with limited flexibility. Deep learning has emerged as an efficient data-driven alternative; however, most existing approaches focus on single-variable models that downscale one variable at a time. This paradigm can lead to redundant computation, limited contextual awareness, and weak cross-variable interactions.To address these limitations, we propose a multi-variable Vision Transformer (ViT) architecture with a shared encoder and variable-specific decoders (1EMD). The proposed model jointly predicts six key climate variables: surface temperature, wind speed, 500 hPa geopotential height,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
