Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry

Syed Nazmus Sakib; Nafiul Haque; Shahrear Bin Amin; Hasan Muhammad Abdullah; Md. Mehedi Hasan; Mohammad Zabed Hossain; Shifat E. Arman

arXiv:2604.20983·cs.CV·April 24, 2026

Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry

Syed Nazmus Sakib, Nafiul Haque, Shahrear Bin Amin, Hasan Muhammad Abdullah, Md. Mehedi Hasan, Mohammad Zabed Hossain, Shifat E. Arman

PDF

1 Datasets

TL;DR

This paper introduces PlantInquiryVQA, a new benchmark for multi-step, intent-driven visual reasoning in botanical diagnosis, highlighting the limitations of current models and the benefits of structured inquiry.

Contribution

It formalizes a Chain of Inquiry framework, provides a large expert-annotated dataset, and demonstrates improved diagnostic reasoning with structured question-guided inquiry.

Findings

01

Multimodal models describe symptoms but struggle with clinical reasoning.

02

Structured inquiry improves diagnostic accuracy and reduces hallucinations.

03

The dataset contains 24,950 images and 138,068 QA pairs with visual grounding.

Abstract

Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

SyedNazmusSakib/PlantInquiryVQA
dataset· 4.8k dl
4.8k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.