TAMMs: Change Understanding and Forecasting in Satellite Image Time Series with Temporal-Aware Multimodal Models
Zhongbin Guo, Yuhao Wang, Ping Jian, Chengzhi Li, Xinyue Chen, Zhen Yang, Ertai E

TL;DR
TAMMs is a unified multimodal model that jointly performs change understanding and satellite image forecasting, significantly improving long-range temporal modeling in satellite image time series analysis.
Contribution
Introduces TAMMs, the first framework to simultaneously handle change detection and forecasting in satellite images using a novel MLLM-diffusion architecture with temporal adaptation and semantic control.
Findings
Outperforms state-of-the-art baselines on both tasks
Enhances long-range temporal understanding in satellite data
Demonstrates effectiveness of joint modeling approach
Abstract
Temporal Change Description (TCD) and Future Satellite Image Forecasting (FSIF) are critical, yet historically disjointed tasks in Satellite Image Time Series (SITS) analysis. Both are fundamentally limited by the common challenge of modeling long-range temporal dynamics. To explore how to improve the performance of methods on both tasks simultaneously by enhancing long-range temporal understanding capabilities, we introduce **TAMMs**, the first unified framework designed to jointly perform TCD and FSIF within a single MLLM-diffusion architecture. TAMMs introduces two key innovations: Temporal Adaptation Modules (**TAM**) enhance frozen MLLM's ability to comprehend long-range dynamics, and Semantic-Fused Control Injection (**SFCI**) mechanism translates this change understanding into fine-grained generative control. This synergistic design makes the understanding from the TCD task to…
Peer Reviews
Decision·ICLR 2026 Poster
1. Timely problem and clear unification, bridging understanding to forecasting in a single, parameter-efficient pipeline for remote-sensing time series. 2. Well-motivated architecture: TAM adds lightweight temporal conditioning without full fine-tuning; SFCI combines structural and semantic control for diffusion, yielding an intuitive, modular design.
1. On FSIF, some standard metrics are only competitive rather than strictly superior to strong generative baselines. A brief discussion and supplemental visuals would help reconcile this with the TCS gains. 2. Prompt robustness is missing: TAM relies on Contextual Temporal Prompting (CTP), but there is no analysis of robustness to prompt phrasing/length/templates; please add prompt ablations. 3. Test-time generalization is not demonstrated. Evaluation hinges on a curated set of 150 long-horizo
- Synergistic training for change detection together with future generation is an interesting idea for enhancing the reasoning capabilities of MLLM in the satellite image domain. - Temporal Consistency Score (TCS) adds value as a metric specifically desgned for application on SITS. - Performance results are strong in Table 1. - Ablation study is performed on proposed modules.
- Some examples of the data used for training (satellite images together with generated captions) should be presented, possibly in supp. material. - writing could be improved, some non exhaustive examples for use of language include: - l.38 "because fails" - l.39-41 "the absence of a unified framework of temporal-aware multimodal model for satellite image change understanding and forecasting named TAMMs" - l.42 "How can model reason" - similarly "Satellite Image Time Series (SITS)" appe
1. The paper provides a unified framework for satellite image time-series modeling, integrating semantic understanding and future prediction into a single closed-loop process. 2. The proposed TAM and SFCI modules are logically designed to enable temporal encoding and generative control while keeping the pretrained MLLM and diffusion models frozen. 3. The introduction of the TCS metric offers a way to quantify spatial consistency between predicted and historical changes, addressing the limitation
1. The study focuses narrowly on satellite image forecasting and does not explore the applicability of the proposed framework to other domains, limiting its generality. 2. The proposed approach appears primarily as an engineering implementation based on pretrained LLMs and diffusion models, with limited introduction of new theoretical mechanisms or modeling principles; thus, its novelty is modest. 3. The overall architecture is complex but lacks a clear explanation of the core ideas and hierarch
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
