TL;DR
This paper introduces a novel multimodal fusion framework for combining time series data and images in Earth observation, enabling cross-modal generation and improved downstream task performance.
Contribution
It presents a task-agnostic approach that aligns discrete tokens from time series and images using masked correlation learning, outperforming task-specific methods.
Findings
Pretrained model generates consistent temperature profiles from satellite images.
Outperforms task-specific fusion by 6% in R^2 and 2% in RMSE.
Exceeds baseline methods by 50% in R^2 and 12% in RMSE.
Abstract
We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images, enabling cross-modal generation and robust downstream performance. Our approach explores deterministic and learned strategies for time series quantization and then leverages a masked correlation learning objective, aligning discrete image and time series tokens in a unified representation space. Instantiated in the Earth observation domain, the pretrained model generates consistent global temperature profiles from satellite imagery and is validated through counterfactual experiments. Across downstream tasks, our task-agnostic pretraining outperforms task-specific fusion by 6% in R^2 and 2% in RMSE on average, and exceeds baseline methods by 50% in R^2 and 12% in RMSE. Finally, we analyze gradient sensitivity across modalities, providing insights into model robustness. Code, data, and…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
The central idea of the paper is discretizing both time series and images into a shared token space is impactful. The approach establishes a task-agnostic, generative, and interpretable framework. Introducing Finite Scalar Quantization (FSQ) for time series is a good technical contribution. FSQ provides a stable, computationally efficient way to discretize long-tailed time series distributions. Treating tokens from different modalities as mutual predictive targets lead task-agnostic learning pri
The masked correlation objective lacks a formal justification or ablation contrasting it with contrastive or mutual information-based alternatives. While FSQ is good to use, the paper does not thoroughly analyze token efficiency versus representation quality. This limits understanding of scalability. The autoregressive generation may conflate spatial priors with temporal correlations. No explicit temporal grounding or causal validation is included. There is no computational efficiency analysis
- The idea of unifying time series and imagery into a shared latent space is interesting and potentially useful for Earth observation tasks. - The technical components, such as masked correlation learning and modality alignment, are reasonable choices.
- The **motivation** for generating global temperature profiles is weak — “why we need global temperature profiles?” is not convincingly explained. Can't we see the date and location information of satellite imagery and check the temperature? - The role of **quantization** is unclear; the paper does not adequately justify why time series need to be quantized for cross-modal alignment. - The “quantizing time series” section is confusing and lacks logical flow between different methods. - In Se
- Fusing static images with timeseries is important in the domain of satellite images as it combines high-resolution spatial context with time series encode evolving dynamics. - Literature and methods around feature quantization are explored in detail. - Code and Data will be provided by the authors.
- The modeling architecture is not exposed in detail, in particular the architecture and size, how the model understands to separate images and time series inputs are separated and how/if position is explicitly encoded. The proposed method is a pretraining method but it is compared againsta existing architectures. The baseline performance of the model in table 2 should be added to the baselines and results should be gathered for competing models as well to ensure a clear comparison. - The motiva
The paper is generally well written and clear. The authors explore various methods for quantizing time. provides some interesting analyses of the method, in particular the assessment of geo-location sensitivity. The problem of fusing time series information with single timestamp image data is interesting and tackling it can be useful for many applications across agriculture, climate and biodiversity.
My main concern is about the claim that this model is task-agnostic. The authors try the method on 4 downstream tasks but they all come from the same dataset, so it's essentially one task of crop yield prediction. Also there is not necessarily a big gain of performance of the model trained from scratch vs with the pre-trained weights. Therefore, I am wondering whether this claim that the model is task agnostic might be misleading. In the sense that it is an architecture that can be used for di
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
