Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
Earl Ranario, Mason J. Earles

TL;DR
This study benchmarks various vision-language models on agricultural image classification tasks, revealing that current models underperform compared to specialized supervised methods and are not yet ready for standalone use.
Contribution
It provides a comprehensive evaluation of open-source and closed-source VLMs on diverse agricultural datasets, highlighting their limitations and potential as assistive tools.
Findings
Zero-shot VLMs underperform supervised baselines like YOLO11.
Best VLM (Gemini-3 Pro) achieves around 62% accuracy with multiple-choice prompts.
Open-ended prompting yields lower accuracy, often below 25%.
Abstract
Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
