Intrinsic Explainability of Multimodal Learning for Crop Yield Prediction
Hiba Najjar, Deepak Pathak, Marlon Nuske, Andreas Dengel

TL;DR
This paper explores the intrinsic explainability of Transformer models in multimodal crop yield prediction, demonstrating their superior performance and providing insights into feature and modality contributions using novel attribution methods.
Contribution
It introduces the use of Transformer-based models for interpretable multimodal crop yield prediction and proposes the Weighted Modality Activation method for modality attribution.
Findings
Transformers outperform CNNs and RNNs in yield prediction accuracy.
Attention Rollout provides more reliable temporal attributions.
Modality attributions vary across methods and are interpretable with agronomic knowledge.
Abstract
Multimodal learning enables various machine learning tasks to benefit from diverse data sources, effectively mimicking the interplay of different factors in real-world applications, particularly in agriculture. While the heterogeneous nature of involved data modalities may necessitate the design of complex architectures, the model interpretability is often overlooked. In this study, we leverage the intrinsic explainability of Transformer-based models to explain multimodal learning networks, focusing on the task of crop yield prediction at the subfield level. The large datasets used cover various crops, regions, and years, and include four different input modalities: multispectral satellite and weather time series, terrain elevation maps and soil properties. Based on the self-attention mechanism, we estimate feature attributions using two methods, namely the Attention Rollout (AR) and…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The model development and analysis are thorough.
The paper lacks a clear novel contribution, as it primarily involves constructing a model for crop prediction and testing different model architectures and feature engineering methods. The authors identify a transformer-based architecture as the best-performing model and use various XAI methods, such as linear probes across layers and attention weights attribution, to investigate the relationship between internal representations and predictions based on data modality. The experiments appear som
* This research paper presents a comprehensive analysis of a wide range of analytical approaches through a detailed ablation studies to compare various model architectures. Further, layer-wise analysis using linear probes provides insights into the evolution of learned representations across the model's depth. Finally, exploration of attention weight distributions and comparison across multiple attribution methods is done. * Evaluation on real-world data for three different crops results in dir
* Choice of baseline models can be improved. For example, ConvLSTM [1], STATT [2], 3D-CNN [3] seems more suitable for satellite image time-series data. * The focus on a dataset from Argentina raises questions about generalizability to other regions and crop types. Including experiments on datasets from various geographic regions would strengthen the model's applicability and reliability. * It would be better if the interpretability results are accompanied with hypothesis/explanations derived fro
- In the literature on crop yield prediction, efforts related to the explainability of this research may be novel. - Comparative experiments and analyses have been conducted extensively.
- The importance of the crop yield prediction task is not adequately stated. Therefore, the usefulness and impact of the analysis results are not conveyed. - The technical contributions of this research are unclear. Although the data may be unique, the prediction model and explanation methods used are existing techniques and appear to be merely applied to crop yield prediction.
- The paper introduces an original approach to crop yield prediction using Transformer-based models, a relatively novel application in this field. Its emphasis on intrinsic interpretability in multimodal learning stands out, offering a fresh alternative to the commonly used post-hoc methods in agricultural and environmental modeling. - The methodology is robust, featuring extensive experimentation across multiple neural network architectures (e.g., LSTM, CNN, and Transformer) to identify the mo
- Although the study compares Transformer models with LSTM, ALSTM, and CNN architectures, including more contemporary multimodal learning approaches could further enhance the analysis. For instance, comparisons with models such as Multimodal Variational Autoencoders or Multimodal Contrastive Learning frameworks would provide a more comprehensive evaluation of the Transformer's effectiveness relative to recent innovations in multimodal integration. - The Transformer model, with its 109,345 param
- This paper is well structured and understandable. - It has thoroughly referred to related paper, both in establishing existing approaches and problem significance, and to corroborate the results of the study. - Each of the interpretability evaluation provides visualisations, which can be leveraged to suggest either particularities of the model architecture or qualities of the data that might influence the model. - The dataset used is very large, and authors make sure that the analysis benefit
- Only one dataset is used, with no variation on modalities or indicators within each modalities. Even within the dataset, the authors do not truly enter a discussion on concept drift between the different fields, despite observing variability. Time-wise drift is not evaluated, despite years going from 2017 to 2023. How generalizable the approach is to other regions and future years is unknown. - Expert understanding is missing from the evaluation, hence several of the explanations that the auth
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSmart Agriculture and AI · Remote Sensing in Agriculture · Explainable Artificial Intelligence (XAI)
