Spatial-Temporal-Spectral Unified Modeling for Remote Sensing Dense Prediction
Sijie Zhao, Feng Liu, Enzhuo Zhang, Yiqing Guo, Pengfeng Xiao, Lei Bai, Xueliang Zhang, Hao Chen

TL;DR
This paper introduces STSUN, a flexible deep learning model that unifies multiple remote sensing dense prediction tasks across diverse data configurations, achieving state-of-the-art results.
Contribution
The paper presents a novel unified network architecture that adapts to arbitrary spatial, temporal, and spectral data, and unifies multiple dense prediction tasks with flexible semantic class handling.
Findings
STSUN effectively adapts to heterogeneous input-output configurations.
It unifies multiple dense prediction tasks within a single model.
The approach achieves state-of-the-art performance across various datasets.
Abstract
The proliferation of multi-source remote sensing data has propelled the development of deep learning for dense prediction, yet significant challenges in data and task unification persist. Current deep learning architectures for remote sensing are fundamentally rigid. They are engineered for fixed input-output configurations, restricting their adaptability to the heterogeneous spatial, temporal, and spectral dimensions inherent in real-world data. Furthermore, these models neglect the intrinsic correlations among semantic segmentation, binary change detection, and semantic change detection, necessitating the development of distinct models or task-specific decoders. This paradigm is also constrained to a predefined set of output semantic classes, where any change to the classes requires costly retraining. To overcome these limitations, we introduce the Spatial-Temporal-Spectral Unified…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
I acknowledge that this review has been produced considering the revised manuscript. 1. The framework, which incorporates metadata of data inputs and dense prediction tasks, is well introduced and carefully distinguishes the spatial, temporal, and spectral input and output dimensions. 2. The authors leveraged metadata to generate an adapted architecture for input and output dimensions, proposing five modules to unify their representations. 3. The learnable task embedding effectively guides th
I acknowledge that this review has been produced considering the revised manuscript. 1. The limitations that this work addresses (L.90-103) are partially studied and grounded; however, several nuances should be considered, which weaken the motivations: a. Fixed configurations: ViTs are capable of processing sequences of arbitrary length; if spatial-temporal-spectral cubes are divided into tokens, this is no longer a limitation in theory [1, 2, 3, 4, 5]. b. Fixed task: I would like to quot
1. The paper is clearly written and easy to follow, with a well-structured presentation of the problem and proposed approach. 2. The idea of unifying spatial, temporal, and spectral dimensions within a single framework is interesting and relevant to challenges in remote sensing dense prediction.
1. The unification of spatial, temporal, and spectral dimensions has already been explored in several recent remote sensing foundation models, such as RingMo-Agent [1] and Falcon [2], which aim to build unified representations across multi-platform and multi-modal data. The paper does not discuss or compare its approach with these existing large-scale models, limiting the clarity of its novelty and positioning. [1] RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Mu
The paper addresses an important and practical issue in remote sensing: heterogeneity of input and output structures. Encoding spatial, temporal, and spectral configurations as metadata is a smart, scalable idea. The local-global attention design helps handle multi-resolution dependencies, and multi-task training improves performance. Experiments are extensive, and results are strong across benchmarks.
1. Despite the strong empirical results, the core components are based on existing ideas. The DUM is a direct application of a transformer-based hypernetwork, and the LGWA is conceptually similar to other multi-scale windowed attention mechanisms found in models like the Swin Transformer or SegFormer. 2. The paper lacks theoretical analysis or deeper insight into its decoupled unification strategy, relying solely on experimental ablation (Table 11) to justify its design. 3. Although the appendix
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote-Sensing Image Classification · Domain Adaptation and Few-Shot Learning · Remote Sensing in Agriculture
MethodsSoftmax · Attention Is All You Need
