Physics-Informed Spectral Modeling for Hyperspectral Imaging
Zuzanna Gawrysiak, Krzysztof Krawiec

TL;DR
PhISM is a physics-informed deep learning model that effectively disentangles hyperspectral data, requiring minimal supervision, and offers interpretable insights, outperforming previous methods on various benchmarks.
Contribution
Introduces PhISM, a novel physics-informed spectral modeling approach that learns without supervision and enhances interpretability in hyperspectral imaging.
Findings
Outperforms prior methods on classification and regression benchmarks.
Requires limited labeled data for training.
Provides interpretable latent representations.
Abstract
We present PhISM, a physics-informed deep learning architecture that learns without supervision to explicitly disentangle hyperspectral observations and model them with continuous basis functions. PhISM outperforms prior methods on several classification and regression benchmarks, requires limited labeled data, and provides additional insights thanks to interpretable latent representation.
| Method | Salinas Valley | Pavia University | Indian Pines | |||
|---|---|---|---|---|---|---|
| OA | AA | OA | AA | OA | AA | |
| 3D [8] | 69.7 | 69.1 | 70.1 | 60.2 | 48.9 | 38.3 |
| 1D [28] | 64.2 | 64.7 | 73.3 | 62.1 | 67.1 | 55.1 |
| BAAS [31] | 73.4 | 74.3 | 69.5 | 60.4 | 46.8 | 35.4 |
| SF [15] | 68.13.2 | 67.72.4 | 69.91.2 | 59.11.6 | 48.60.6 | 39.51.1 |
| 3DAES [17] | 73.12.8 | 77.82.2 | 68.51.5 | 69.21.7 | 63.70.5 | 53.11.0 |
| Autoencoder | 71.43.8 | 76.22.3 | 66.11.8 | 66.51.1 | 59.80.7 | 50.40.8 |
| PhISM (ours) | 73.43.8 | 78.32.5 | 67.41.9 | 68.01.2 | 64.40.4 | 54.60.8 |
| PhISM (fixed) | 70.34.0 | 75.12.7 | 66.11.9 | 66.61.3 | 57.70.5 | 48.80.9 |
| Method | Train-set percentage | ||||
|---|---|---|---|---|---|
| 50% | 10% | 5% | 1% | 0.5% | |
| 3DAES [17] | 84.70.3 | 83.90.6 | 83.20.7 | 75.41.1 | 69.11.3 |
| Autoencoder | 79.90.2 | 77.80.6 | 76.60.6 | 67.52.9 | 65.41.5 |
| Raw | 81.30.4 | 80.80.5 | 78.60.8 | 72.91.9 | 56.12.5 |
| PhISM (ours) | 82.50.3 | 80.10.7 | 79.20.8 | 73.61.6 | 70.03.0 |
| Dataset | Raw | Autoencoder | PhISM (ours) | PhISM (fixed) |
|---|---|---|---|---|
| H1 | 0.7230.066 | 0.7320.069 | 0.7210.064 | 0.7980.091 |
| H2 | 0.5010.098 | 0.4930.091 | 0.3890.095 | 0.4840.081 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Physics-Informed Spectral Modeling
for Hyperspectral Imaging
Zuzanna Gawrysiak and Krzysztof Krawiec Z. Gawrysiak and K. Krawiec are with the Institute of Computing Science, Poznan University of Technology, Poznan, Poland e-mail: {zgawrysiak, kkrawiec}@cs.put.poznan.pl.Manuscript received xxxx; revised September xxxx.
Abstract
We present PhISM, a physics-informed deep learning architecture that learns without supervision to explicitly disentangle hyperspectral observations and model them with continuous basis functions. PhISM outperforms previous methods on several classification and regression benchmarks, requires limited labeled data, and provides additional insights thanks to its interpretable latent representation.
Index Terms:
Hyperspectral imaging, Self-supervised learning, Representation learning, Explainable AI.
I Introduction
Hyperspectral remote sensing (RS) captures a high-resolution signature, a physically grounded pattern that describes how materials reflect or absorb light across wavelengths, and enables fine-grained discrimination that far exceeds the capabilities of conventional imaging. However, the high-dimensional feature space challenges machine learning (ML) methods: models require more parameters, become data-hungry, and prone to overfitting, especially when labels are scarce. Conventional deep learning (DL) models, such as Convolutional Neural Networks (CNNs), treat input channels as independent features, ignoring the physical correlations between neighboring spectral bands, which forces the model to re-discover the well-established knowledge from data alone and increases the risks of forming hypotheses that are physically implausible [10, 25].
To address these challenges, we propose Physically Informed Spectral Modeler (PhISM), an architecture that incorporates domain knowledge by representing spectral components with transparent latent basis functions, each controlled by a small number of interpretable parameters. PhISM achieves performance competitive with state-of-the-art techniques, works robustly with limited training data, and, by operating in the realm of spectral features that are familiar to geoscientists, is more interpretable than DL methods.
II The approach
PhISM is based on the autoencoder blueprint and involves two stages (Fig. 1): (i) autoassociative, self-supervised and task-agnostic training of the autoencoder, to form informative latent representations that enable possibly accurate reconstruction of the input image (Section II-A), and (ii) task-specific training of a prediction module that maps that latent representation to the respective dependent variables (Section II-B).
II-A Self-supervised modeling and reconstruction of spectra
PhISM’s autoencoder comprises an encoder and a decoder that communicate via compact latent representation (Fig. 1). The model processes each image pixel independently, in parallel, which ensures that the learned estimates rely exclusively on the physical spectral signature and avoid confounding spectral signals with spatial textures and information leakage from the neighborhood. The encoder is a lightweight, pixel-wise CNN feature extractor comprising five 1 × 1 convolutional layers with 512, 1024, 512, 256, and 4 channels, respectively (1.2M total parameters), each followed by batch normalization and Leaky ReLU activation. The input dimension matches the number of spectral bands. The decoder, rather than relying on typical DL components, explicitly parametrizes continuous spectral components represented with basis functions, which together form the reconstructed spectrum. The decoder is thus ‘expressing’ the spectral components parameterized by the encoder; since this comes down to sampling of the basis functions at specific wavelengths, we refer to it as renderer.
For the model to be trainable end-to-end with gradient, the basis functions need to be differentiable, which holds for, e.g., splines, polynomials, normal distributions, beta distributions, and skew normal distributions. Here, we use the skew normal distribution, as it turned out to fare best in preliminary experiments on the HYPERVIEW benchmark (Sec. IV) in terms of reconstruction measured by PSNR (splines: 40.5, polynomials: 38.3, normal: 38.8, beta: 26.5, skew normal: 48.5). Its probability density function (PDF) is parametrized by the mean (), standard deviation (), and skew ():
[TABLE]
where is the wavelength, is the PDF of the normal distribution and is the cumulative distribution function (CDF) of the standard normal distribution, . The total estimate at wavelength is:
[TABLE]
where , , and parameterize the th spectral component, and the scale modulates its contribution to the spectrum. To calculate the per-pixel output, the renderer simply queries at the wavelengths corresponding to the input bands.
There are thus parameters per spectral component (, , and ), which requires dimensions in the latent, where is usually moderate (). We use the sigmoid activation function for and to ensure non-negativity, and the activation function for and . Then, the outputs of the activation functions are multiplied by the number of spectral channels (e.g. 224 for the AVIRIS sensor) and fed into Eq. (2). Notice that the signed allows the components to contribute positively or negatively. The model works with spectra that are zero-centered w.r.t. the means calculated from the training set, i.e. Eq. (2) estimates the signed divergence from them.
Training follows the standard autoencoder blueprint: the encoder produces the latent vector, the decoder uses it to render and combine the spectral components, and the resulting spectrum is compared to the input spectrum with the Huber loss function [16], which combines the advantages of MSE and MAE. The AdamW optimizer [23] updates the encoder’s parameters (the renderer has no trainable parameters) at 0.0001 learning rate for 50 epochs or until the validation loss does not improve for 5 epochs (early stopping). The training process is entirely self-supervised, and thus does not require ground-truth data, which is often scarce and hard to come by.
To select the optimal number of components, we performed a sensitivity analysis by measuring the PSNR for and observed that it increases from 42.1 at to 48.5 at , and then plateaus. Consequently, we use to maximize interpretability without compromising fidelity.
Though the composition of separately modeled spectral components bears resemblance to spectral unmixing [19], PhISM significantly diverges from it by (i) not relying on predefined spectral components, but learning them from data, and (ii) modeling them with smooth basis functions, to match the characteristics and variability of spectral patterns, while keeping their complexity at bay. Rather than aiming at maximally faithful modeling of physical processes, we aim at a degree of physical plausibility that both constrains and informs our models, so that they generalize well.
II-B Supervised learning for prediction of dependent variables
Once the autoencoder has been trained, we discard the renderer (decoder) and use the compact interpretable latent features for predictive downstream regression and classification tasks. We achieve this by appending an arbitrary ML model to the encoder and training it in a supervised fashion on the available labeled ground-truth data (bottom part of Fig. 1). Because the number of latent features is low, well-performing predictive models can be trained even from very small samples of labeled pixels (Sec. IV). Also, one can opt for a transparent ML model (e.g., a decision tree) to improve the overall interpretability.
III Related work
Incorporating domain-specific knowledge [6] and physical principles [18] into ML models bridges data-driven and physically-grounded approaches, enabling better generalization and interpretability. The efficacy of enforcing physical constraints has been well-established in many engineering domains, e.g. thermal [22] and electrochemical modeling [34]. In RS, a range of works attempted to inject the relevant priors into DL models explicitly, e.g. by engaging predefined ontologies [20]. Zheng et al. [37] combined spectral unmixing with deep learning to enhance image fusion and generate high-resolution hyperspectral images from high-resolution multispectral and low-resolution hyperspectral inputs. Unsupervised dehazing networks augmented with hybrid priors have shown promising results in improving the quality of hyperspectral images [13].
Physics-inspired approaches also provide robust solutions for unsupervised super-resolution of hyperspectral data, as demonstrated by the physics-driven autoencoder presented by [21]. Camps-Valls et al. [3] integrated physics-driven insights to address geoscience-specific challenges in RS. CRANN [35] used physics-based principles combined with neural networks to retrieve cloud properties from hyperspectral measurements. Li et al. [36] leveraged spatial autocorrelation to explicitly account for spatial relationships, enabling improved detection of terrain features under weak supervision. VarioCNN [14] combined physically constrained neural networks with deep CNNs to analyze complex glaciological processes (crevasse classification). GASlumNet [24] integrated DL with geoscientific prior knowledge to improve slum mapping accuracy. Ge et al. [10] outlined the Geoscience-Aware DL paradigm that integrates geoscience knowledge into DL frameworks at various stages of modeling. These methods leveraged domain physics extrinsically, via simulated training data [35] or spatial statistics [14]. In contrast, PhISM embeds physics intrinsically by embodying continuous basis functions (Eq. (2)).
Given that the RS-specific domain knowledge can be often represented in symbolic form, a number of works can be seen as subscribing to the paradigm of neurosymbolic AI [9, 32]. Harmon et al. [12] used probabilistic soft logic rules to encode expert insights into a neuro-symbolic model, improving tree crown delineation and enabling generalization beyond annotated data. Incorporating domain knowledge in the form of equations embedded in the loss function proved particularly effective in the classification of tree species, while also enhancing explainability [11]. Chen et al. [5] discussed implications for mineral prediction, underscoring the synergy between symbolic reasoning and neural methods. Potnis et al. [30] integrated geospatial knowledge graphs into DL models to enhance neurosymbolic AI for RS scene understanding.
PhISM’s novelty in relation to past work consists in explicit modeling of spectral components using continuous, differentiable formulas, which facilitates self-supervised training from small data and is more interpretable than DL approaches.
IV Results
We demonstrate PhISM on a number of classification and regression benchmarks, following the procedure outlined in Sec. II: we fit the autoencoder to the training set (Sec. II-A) and combine it with a predictive ML model, which we train to map the encoder’s latent to the dependent variable (Sec. II-B). The data is first zero-centered by decreasing the values in each spectral band by the average calculated from the training set.
The method has been implemented in PyTorch. A typical cross validation experiment took, respectively, 8 and 30 minutes for a single classification and regression benchmark, on an NVIDIA A100 GPU with 80 GB of VRAM. Technical details can be found in the source code repository.111https://github.com/zuzg/domain-aware-hyperspectral-ml
IV-A Results for classification tasks
We use the modernized versions of three popular pixel classification benchmarks: Salinas Valley (SV), agricultural area captured with the AVIRIS sensor222https://aviris.jpl.nasa.gov, 224 bands, 16 classes; Pavia University (PU), urban area captured with ROSIS sensor, 103 bands, 9 classes; Indian Pines (IP), mixed agricultural/forest area, AVIRIS sensor, 200 bands, 16 classes. To avoid information leakage and provide fair and reproducible comparison, we use the fixed partitioning of data into a training part (spatially disjoint patches) and testing parts (all remaining pixels) proposed in [28].333Random partitioning of pixels into training and test sets leads to information leaks and overly optimistic accuracy estimates, up to 100% [7]. In each of 4 (IP) or 5 (SV, PU) cross-validation folds, we first train our autoencoder with spectral components (Sec. II-A), resulting in a -dimensional latent representation. These features are then used to train a pixel-wise XGBoost classifier [4] (Sec. II-B). We report the overall accuracy (OA), i.e. the ratio of the correctly predicted pixels over all test pixels, and the average accuracy (AA), i.e. the mean of per-class accuracies, to address the class imbalance. We repeated the training and testing in each fold 5 times with different seeds, so the presented results summarize 20 (IP) or 25 (SV, PU) runs of the method.
IV-A1 Results
In Table I, we compare PhISM against six methods: 1D CNN [28], operating on per-pixel spectra, 3D CNN [8], processing small spatial-spectral cubes, Band-Adaptive Spectral-Spatial Feature Learning, BAAS [31], SpectralFormer (SF) [15], a Transformer-based architecture, 3DAES [17], an autoencoder-based architecture, and the conventional DL autoencoder. The latter comprises the same encoder architecture as PhISM’s and a 1×1-convolutional decoder that ‘mirrors’ the encoder (doubling thus the number of PhISM’s parameters); the XGBoost learns from the -dimensional latent of this model. To ensure a fair comparison, the autoencoder was tuned using Optuna [1].
Despite not being optimized specifically for segmentation, PhISM achieves the best AA on SV and PU, and is competitive on IP, confirming the generality and strong discriminative capacity of the learned representations. On OA, PhISM yields to other methods; however, this metric largely neglects the smaller decision classes, which is particularly undesirable for the considered benchmarks, where the number of pixels per decision class can vary by more than an order of magnitude.
In the ablated PhISM (fixed) variant, , and are optimized in training, but do not depend on the observed input spectrum (like biases in DL units). These models form fixed spectral components that are mixed linearly with the input-dependent scales , akin to spectral unmixing (cf. Sec. II). The significantly worse performance of this variant corroborates the need for pixel-wise shaping of spectral components.
IV-A2 Visualization of components
Figure 2 presents the spectral components produced by one of the models trained on PU for three testing pixels selected randomly from the largest decision classes: asphalt, meadows, and bare soil. Curve color corresponds to component index ( in Eq. (2)). In contrast to spectral unmixing that controls only the weights of spectral components, PhISM also modulates their shapes and can model both the positive and the negative contributions, which in principle allows capturing, respectively, emission and absorption at particular wavelengths.
IV-A3 Interpretability
The explicit representation of components eases interpretation of inference conducted by PhISM. For instance, the parameters of shown for the example pixels in Fig. 2 reveal that consecutive components tend to focus on increasing wavelengths, with operating around the green hue, while covering infrared wavelengths. Further insights can be obtained by, e.g., inspecting attribute importance using the Shapley interaction values [26].
Each encoder instance, by starting training from a random initial configuration of parameters, may in principle converge to different spectral components. To assess the replicability of this process, we trained and evaluated 10 models and examined the distributions of the four predicted parameters of skew normal functions. The median of per-image standard deviations for and ( range) were below 0.04, and below 0.09 for and ( range). This stability shows that representation biases imposed by the skew normal functions and the low-dimensional latent space regularize the model effectively.
IV-A4 Emergence of structure in the latent
Figure 3 presents the 2D projection of PhISM’s latent space, obtained by applying the t-SNE method [33] to the 20 parameters that control the spectral components in the model trained on the PU dataset. Clusters of observations that represent materials of similar constitution (e.g., Bitumen and Asphalt, Meadows and Bare soil) tend to overlap, which suggests that self-supervision was sufficient to adequately capture their spectral similarity. Conversely, classes that have little in common (e.g., Asphalt and Meadows) are clearly separated. Some classes (Metal sheets, Shadows) form compact, isolated clusters, which in principle allows delineating them without explicit labeling of pixels (i.e., labeling them post-hoc).
IV-A5 Learning from small data
To simulate label-scarce conditions, we trained independent XGBoost models on small subsets (0.5-50%) of the PU training set processed with the same encoder architectures as in Table I, but trained with , and queried them on the fixed set of the remaining 50% of pixels. The values of AA obtained by repeating this process 10 times for different random seeds, reported in Table II, are higher than in Table I, because the partitioning of pixels into train and test sets is here random. Crucially however, PhISM fares systematically better than for Raw and Autoencoder and degrades more gracefully when labeled training data become gradually more scarce. While the self-supervised 3DAES [17] is less impacted by moderate deprivation of labeled data (10-5%), PhISM maintains comparable performance in the extreme low-data regime (0.5%). This stability suggests that PhISM’s physics-informed constraints act as a regularizer.
IV-B Results for regression tasks
We apply PhISM to the regression tasks posed in the HYPERVIEW challenge [27] (H1, data acquired with the HySpex VS-725 sensor) and HYPERVIEW 2 challenge (H2, data from PRISMA444https://directory.eoportal.org/satellite-missions/prisma-hyperspectral)555https://platform.ai4eo.eu/hyperview2. For H1, the soil parameters to be predicted are K, P, Mg, and pH level; for H2, these are B, Cu, Zn, Fe, S, and Mn. In contrast to the above classification tasks, the dependent variables in H1 and H2 are given per image patch, rather than per pixel. We use only the publicly available parts from both challenges, for which the values of the dependent variables are available. For H1, these are 1,732 patches, which we divide into 1,000 training samples, 124 for validation, and 608 for testing; the average patch size is 60 × 60 pixels with 150 hyperspectral bands. For H2, there are 1,876 patches, which we divide into 1,000 training samples, 124 for validation, and 752 for testing; the average patch size is 2 × 2 pixels (60x60 meters) with 230 hyperspectral bands.
The self-supervised phase of training remains the same as in classification, i.e. the model learns to reproduce the spectrum in each pixel, with set to 5. We then average the latent representations per patch and train on them a separate Random Forest [2] regressor for each of the dependent variables.
Table III compares the performance of PhISM to the baselines in terms of the error score used in the challenges (Hyperview score [27]), which aggregates the errors committed on all dependent variables relative to fixed baselines as . The baselines are simple Autoencoder (as in classification tasks) and Raw configurations, in which the Random Forest learns directly from the spectral channels averaged over a patch. PhISM slightly outperforms both baselines on H1; for H2, its superiority is much more evident. The fixed variant fares worse again, confirming the usefulness of the pixel-dependent prediction of all parameters of PhISM’s spectral components.
V Conclusion
We have shown that equipping DL models with physics-inspired priors informs them effectively and offers better predictive accuracy, lower demand for labeled data, and more transparency of the inference process. Overall, the neurosymbolic architectures [9, 32] offer a particularly promising and natural framework for incorporating the wealth of RS-related domain knowledge, and will continue to be the subject of our further research. Among others, we plan to exploit PhISM’s use of continuous physical parameters (e.g., wavelength in nm), rather than discrete band indices, as it facilitates cross-sensor transferability: unlike in standard CNNs, the learned latent representation is sensor-agnostic. Future work will transfer models between sensors by mapping diverse spectral samplings to this unified physical space.
Acknowledgment: Research supported by the statutory funds of Poznan University of Technology and the Polish Ministry of Science and Higher Education, grant no. 2025/57/B/ST6/03737.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama (2019) Optuna: A Next-generation Hyperparameter Optimization Framework . In Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining , Cited by: § IV-A 1 .
- 2[2] L. Breiman (2001) Random forests . Mach. Learn. 45 , pp. 5–32 . Cited by: § IV-B .
- 3[3] G. Camps-Valls, D. H. Svendsen, J. Cortés-Andrés, Á. Mareno-Martínez, A. Pérez-Suay, J. Adsuara, I. Martín, M. Piles, J. Muñoz-Marí, and L. Martino (2021) Physics-Aware Machine Learning for Geosciences and Remote Sensing . In Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS) , Vol. , pp. 2086–2089 . External Links: Document Cited by: §III . · doi ↗
- 4[4] T. Chen and C. Guestrin (2016) XG Boost: A Scalable Tree Boosting System . In Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining , pp. 785–794 . Cited by: § IV-A .
- 5[5] W. Chen, X. Ma, Z. Wang, W. Li, C. Fan, J. Zhang, X. Que, and C. Li (2024-06) Exploring neuro-symbolic AI applications in geoscience: implications and future directions for mineral prediction . Earth Sci. Inform. 17 ( 3 ), pp. 1819–1835 ( en ). External Links: ISSN 1865-0481 , Document Cited by: §III . · doi ↗
- 6[6] T. Dash, S. Chitlangia, A. Ahuja, and A. Srinivasan (2022-01) A review of some techniques for inclusion of domain-knowledge into deep neural networks . Sci. Rep. 12 , pp. . External Links: Document Cited by: §III . · doi ↗
- 7[7] H. Feng, Y. Wang, Z. Li, N. Zhang, Y. Zhang, and Y. Gao (2023) Information Leakage in Deep Learning-Based Hyperspectral Image Classification: A Survey . Remote Sens. 15 ( 15 ). External Links: ISSN 2072-4292 , Document Cited by: footnote 3 . · doi ↗
- 8[8] Q. Gao, S. Lim, and X. Jia (2018) Hyperspectral Image Classification Using Convolutional Neural Networks and Multiple Feature Learning . Remote Sens. 10 ( 2 ). External Links: Document Cited by: TABLE I , § IV-A 1 . · doi ↗
