# Robust Fall Army Worm detection in maize using multimodal RGB and thermal image fusion

**Authors:** Prakash Sandhya, B Venkataramana

PMC · DOI: 10.1038/s41598-025-29784-8 · Scientific Reports · 2025-12-25

## TL;DR

This paper introduces a deep learning framework that combines RGB and thermal images to accurately detect Fall Army Worm infestations in maize crops.

## Contribution

The novel hybrid DNN-ViT model integrates multimodal image fusion for improved FAW detection accuracy in maize.

## Key findings

- The fused model achieved 0.98 accuracy, precision, recall, and AUC-ROC on the test set.
- Ablation studies showed that multimodal fusion significantly outperformed no-fusion models.
- RGB-thermal fusion outperformed models using only RGB or thermal data.

## Abstract

Effective pest and disease detection plays a crucial role in minimizing crop losses and improving decision-making in precision agriculture. Among the most destructive pests affecting maize crops globally is the Fall Army Worm (FAW), known for its rapid spread and high impact on yield. Existing detection practices often rely on manual scouting, which can be inefficient, labour intensive and prone to human error. This study proposes a novel deep learning based framework for the automatic classification of FAW infested and healthy maize crops by integrating RGB and thermal image modalities. The core objective is to enhance detection accuracy through multimodal image fusion. A hybrid DNN-ViT model is introduced, combining two complimentary pipelines: (i) feature-level fusion, where CNN extracted features from RGB and thermal images are fused and classified using a Deep Neural Network (DNN) and (ii) image-level fusion, where a 6 channel RGB-thermal image is directly processed using a modified Vision Transformer (ViT). Experimental results demonstrate that the fused model achieved superior performance with an accuracy of 0.98, precision, recall and F1-score of 0.98 and AUC-ROC of 0.98 on the test set, outperforming models trained on RGB-only, thermal-only and unfused data. The ablation study confirms the effectiveness of multimodal fusion, with the no-fusion model showing significantly lower performance (accuracy-0.60 and AUC-ROC-0.67). This work highlights the benefits of integrating complementary data sources for robust crop health monitoring. Future research will explore enhanced fusion strategies, environmental robustness and field level deployment to validate the model’s practical applicability.

The online version contains supplementary material available at 10.1038/s41598-025-29784-8.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12775024/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12775024/full.md

## References

12 references — full list in the complete paper: https://tomesphere.com/paper/PMC12775024/full.md

---
Source: https://tomesphere.com/paper/PMC12775024