From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping

Yu Wu; Guangzeng Han; Ibra Niang Niang; Francia Ravelombola; Maiara Oliveira; Jason Davis; Dong Chen; Feng Lin; Xiaolei Huang

arXiv:2604.09907·cs.CV·April 14, 2026

From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping

Yu Wu, Guangzeng Han, Ibra Niang Niang, Francia Ravelombola, Maiara Oliveira, Jason Davis, Dong Chen, Feng Lin, Xiaolei Huang

PDF

TL;DR

This paper introduces PlantXpert, a multimodal reasoning benchmark for plant phenotyping, evaluating vision-language models on soybean and cotton with a focus on domain-specific and complex agronomic reasoning.

Contribution

It provides a structured dataset and evaluation framework for domain-adapted multimodal models in plant science, highlighting current capabilities and challenges.

Findings

01

Fine-tuning improves model accuracy significantly.

02

Scaling models beyond a point yields diminishing returns.

03

Quantitative and biological reasoning remain challenging.

Abstract

To improve crop genetics, high-throughput, effective and comprehensive phenotyping is a critical prerequisite. While such tasks were traditionally performed manually, recent advances in multimodal foundation models, especially in vision-language models (VLMs), have enabled more automated and robust phenotypic analysis. However, plant science remains a particularly challenging domain for foundation models because it requires domain-specific knowledge, fine-grained visual interpretation, and complex biological and agronomic reasoning. To address this gap, we develop PlantXpert, an evidence-grounded multimodal reasoning benchmark for soybean and cotton phenotyping. Our benchmark provides a structured and reproducible framework for agronomic adaptation of VLMs, and enables controlled comparison between base models and their domain-adapted counterparts. We constructed a dataset comprising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.