TL;DR
AgroTools is a comprehensive benchmark designed to evaluate multimodal agents in agriculture, focusing on tool use, process accuracy, and task success, highlighting current model limitations.
Contribution
Introduces AgroTools, a new benchmark with structured annotations and diverse tasks for assessing tool-augmented multimodal agricultural agents.
Findings
Current models are unreliable in agricultural tool-use tasks.
Bottlenecks identified in tool planning and execution recovery.
Benchmark and evaluation code available at Hugging Face.
Abstract
Agricultural decision-making increasingly requires multimodal systems that can transform visual observations into reliable, executable actions. However, existing agricultural multimodal benchmarks mainly evaluate final-answer correctness and provide limited support for assessing whether models can use external tools to complete precision-sensitive workflows. In this paper, we introduce AgroTools, a benchmark for evaluating tool-augmented multimodal agents in agriculture. AgroTools contains 539 question-answer instances paired with 1,097 heterogeneous agricultural images, spanning five task families and an executable environment of 14 agricultural tools. Each query is annotated with structured tool-use traces, enabling a dual-view evaluation of both process-level execution quality and outcome-level task success. We benchmark 9 open-source and 4 closed-source multimodal large language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
