A Two-Stage Multitask Vision-Language Framework for Explainable Crop Disease Visual Question Answering

Md. Zahid Hossain; Most. Sharmin Sultana Samu; Md. Rakibul Islam; Md. Siam Ansary

arXiv:2601.05143·cs.CV·March 10, 2026

A Two-Stage Multitask Vision-Language Framework for Explainable Crop Disease Visual Question Answering

Md. Zahid Hossain, Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Siam Ansary

PDF

Open Access

TL;DR

This paper introduces a lightweight, explainable two-stage vision-language framework for crop disease VQA, achieving high accuracy and strong generalization with interpretability features, suitable for practical agricultural applications.

Contribution

The work presents a novel two-stage training strategy combining multitask visual classification with language decoding, enhancing crop disease VQA performance and interpretability.

Findings

01

Achieved 99.94% plant classification accuracy

02

Achieved 99.06% disease classification accuracy

03

Generalized well to external VQA benchmark with 83.18% accuracy

Abstract

Visual question answering (VQA) for crop disease analysis requires accurate visual understanding and reliable language generation. In this work, we present a lightweight and explainable vision-language framework for crop and disease identification from leaf images. The proposed approach integrates a Swin Transformer vision encoder with sequence-to-sequence language decoders. The vision encoder is first trained in a multitask setup for both plant and disease classification, and then frozen while the text decoders are trained, forming a two-stage training strategy that enhances visual representation learning and cross-modal alignment. We evaluate the model on the large-scale Crop Disease Domain Multimodal (CDDM) dataset using both classification and natural language generation metrics. Experimental results demonstrate near-perfect recognition performance, achieving 99.94% plant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning