Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry
Syed Nazmus Sakib, Nafiul Haque, Shahrear Bin Amin, Hasan Muhammad Abdullah, Md. Mehedi Hasan, Mohammad Zabed Hossain, Shifat E. Arman

TL;DR
This paper introduces PlantInquiryVQA, a new benchmark for multi-step, intent-driven visual reasoning in botanical diagnosis, highlighting the limitations of current models and the benefits of structured inquiry.
Contribution
It formalizes a Chain of Inquiry framework, provides a large expert-annotated dataset, and demonstrates improved diagnostic reasoning with structured question-guided inquiry.
Findings
Multimodal models describe symptoms but struggle with clinical reasoning.
Structured inquiry improves diagnostic accuracy and reduces hallucinations.
The dataset contains 24,950 images and 138,068 QA pairs with visual grounding.
Abstract
Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
