Intrinsic Explainability of Multimodal Learning for Crop Yield Prediction

Hiba Najjar; Deepak Pathak; Marlon Nuske; Andreas Dengel

arXiv:2508.06939·cs.AI·August 12, 2025

Intrinsic Explainability of Multimodal Learning for Crop Yield Prediction

Hiba Najjar, Deepak Pathak, Marlon Nuske, Andreas Dengel

PDF

Open Access 5 Reviews

TL;DR

This paper explores the intrinsic explainability of Transformer models in multimodal crop yield prediction, demonstrating their superior performance and providing insights into feature and modality contributions using novel attribution methods.

Contribution

It introduces the use of Transformer-based models for interpretable multimodal crop yield prediction and proposes the Weighted Modality Activation method for modality attribution.

Findings

01

Transformers outperform CNNs and RNNs in yield prediction accuracy.

02

Attention Rollout provides more reliable temporal attributions.

03

Modality attributions vary across methods and are interpretable with agronomic knowledge.

Abstract

Multimodal learning enables various machine learning tasks to benefit from diverse data sources, effectively mimicking the interplay of different factors in real-world applications, particularly in agriculture. While the heterogeneous nature of involved data modalities may necessitate the design of complex architectures, the model interpretability is often overlooked. In this study, we leverage the intrinsic explainability of Transformer-based models to explain multimodal learning networks, focusing on the task of crop yield prediction at the subfield level. The large datasets used cover various crops, regions, and years, and include four different input modalities: multispectral satellite and weather time series, terrain elevation maps and soil properties. Based on the self-attention mechanism, we estimate feature attributions using two methods, namely the Attention Rollout (AR) and…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 3

Strengths

The model development and analysis are thorough.

Weaknesses

The paper lacks a clear novel contribution, as it primarily involves constructing a model for crop prediction and testing different model architectures and feature engineering methods. The authors identify a transformer-based architecture as the best-performing model and use various XAI methods, such as linear probes across layers and attention weights attribution, to investigate the relationship between internal representations and predictions based on data modality. The experiments appear som

Reviewer 02Rating 3Confidence 4

Strengths

* This research paper presents a comprehensive analysis of a wide range of analytical approaches through a detailed ablation studies to compare various model architectures. Further, layer-wise analysis using linear probes provides insights into the evolution of learned representations across the model's depth. Finally, exploration of attention weight distributions and comparison across multiple attribution methods is done. * Evaluation on real-world data for three different crops results in dir

Weaknesses

* Choice of baseline models can be improved. For example, ConvLSTM [1], STATT [2], 3D-CNN [3] seems more suitable for satellite image time-series data. * The focus on a dataset from Argentina raises questions about generalizability to other regions and crop types. Including experiments on datasets from various geographic regions would strengthen the model's applicability and reliability. * It would be better if the interpretability results are accompanied with hypothesis/explanations derived fro

Reviewer 03Rating 3Confidence 3

Strengths

- In the literature on crop yield prediction, efforts related to the explainability of this research may be novel. - Comparative experiments and analyses have been conducted extensively.

Weaknesses

- The importance of the crop yield prediction task is not adequately stated. Therefore, the usefulness and impact of the analysis results are not conveyed. - The technical contributions of this research are unclear. Although the data may be unique, the prediction model and explanation methods used are existing techniques and appear to be merely applied to crop yield prediction.

Reviewer 04Rating 3Confidence 4

Strengths

- The paper introduces an original approach to crop yield prediction using Transformer-based models, a relatively novel application in this field. Its emphasis on intrinsic interpretability in multimodal learning stands out, offering a fresh alternative to the commonly used post-hoc methods in agricultural and environmental modeling. - The methodology is robust, featuring extensive experimentation across multiple neural network architectures (e.g., LSTM, CNN, and Transformer) to identify the mo

Weaknesses

- Although the study compares Transformer models with LSTM, ALSTM, and CNN architectures, including more contemporary multimodal learning approaches could further enhance the analysis. For instance, comparisons with models such as Multimodal Variational Autoencoders or Multimodal Contrastive Learning frameworks would provide a more comprehensive evaluation of the Transformer's effectiveness relative to recent innovations in multimodal integration. - The Transformer model, with its 109,345 param

Reviewer 05Rating 6Confidence 3

Strengths

- This paper is well structured and understandable. - It has thoroughly referred to related paper, both in establishing existing approaches and problem significance, and to corroborate the results of the study. - Each of the interpretability evaluation provides visualisations, which can be leveraged to suggest either particularities of the model architecture or qualities of the data that might influence the model. - The dataset used is very large, and authors make sure that the analysis benefit

Weaknesses

- Only one dataset is used, with no variation on modalities or indicators within each modalities. Even within the dataset, the authors do not truly enter a discussion on concept drift between the different fields, despite observing variability. Time-wise drift is not evaluated, despite years going from 2017 to 2023. How generalizable the approach is to other regions and future years is unknown. - Expert understanding is missing from the evaluation, hence several of the explanations that the auth

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSmart Agriculture and AI · Remote Sensing in Agriculture · Explainable Artificial Intelligence (XAI)