Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Earl Ranario; Mason J. Earles

arXiv:2512.15977·cs.CV·May 12, 2026

Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Earl Ranario, Mason J. Earles

PDF

TL;DR

This study benchmarks various vision-language models on agricultural image classification tasks, revealing that current models underperform compared to specialized supervised methods and are not yet ready for standalone use.

Contribution

It provides a comprehensive evaluation of open-source and closed-source VLMs on diverse agricultural datasets, highlighting their limitations and potential as assistive tools.

Findings

01

Zero-shot VLMs underperform supervised baselines like YOLO11.

02

Best VLM (Gemini-3 Pro) achieves around 62% accuracy with multiple-choice prompts.

03

Open-ended prompting yields lower accuracy, often below 25%.

Abstract

Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.