Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

Qingmei Li; Yang Zhang; Zurong Mai; Yuhang Chen; Shuohong Lou; Henglian Huang; Jiarui Zhang; Zhiwei Zhang; Yibin Wen; Weijia Li; Haohuan Fu; Jianxi Huang; Juepeng Zheng

arXiv:2505.12207·cs.CV·August 14, 2025

Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, Haohuan Fu, Jianxi Huang, Juepeng Zheng

PDF

Open Access 4 Datasets

TL;DR

This paper introduces AgroMind, a comprehensive benchmark for evaluating large multimodal models in agricultural remote sensing, highlighting current limitations and guiding future improvements in domain-specific understanding.

Contribution

We present AgroMind, a new benchmark with diverse tasks and datasets for assessing LMMs in agricultural remote sensing, addressing previous dataset limitations and providing a standardized evaluation framework.

Findings

01

LMMs show significant gaps in spatial reasoning and fine-grained recognition.

02

Human performance is surpassed by several LMMs in some tasks.

03

Current models have notable limitations in domain-specific agricultural understanding.

Abstract

Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating eight public datasets and one private farmland plot dataset, containing 27,247 QA pairs and 19,615 images. The pipeline begins with multi-source data pre-processing,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSmart Agriculture and AI · Remote Sensing in Agriculture · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training