# Vision-language models for zero-shot weed detection and visual reasoning in UAV-based precision agriculture

**Authors:** Muhammad Fahad Nasir, Mobeen Ur Rehman, Irfan Hussain

PMC · DOI: 10.3389/fpls.2025.1735096 · Frontiers in Plant Science · 2026-01-29

## TL;DR

This paper explores using vision-language models for weed detection in agriculture, showing that some models can work without prior training data and offer good interpretability.

## Contribution

The study introduces Error-Probing Prompting and evaluates multiple vision-language models for zero-shot weed detection in UAV imagery.

## Key findings

- Gemini Flash 2.5 shows the most consistent zero-shot performance and highest interpretability.
- ChatGPT-4.1 excels in reasoning but has lower raw detection accuracy.
- Interpretability correlates with spatial correctness in weed detection.

## Abstract

Weeds remain a major constraint to row-crop productivity, yet current deep learning approaches for UAV imagery often require extensive annotation, generalize poorly across fields, and provide limited interpretability. We investigate whether modern vision–language models (VLMs) can address these gaps in a zero-shot setting. Using drone images from soybean fields with ground-truth weed boxes, we evaluate six commercial VLMs, ChatGPT-4.1, ChatGPT-4o, Gemini Flash 2.5, Gemini Flash Lite 2.5, LLaMA-4 Scout, and LLaMA-4 Maverick under a unified prompt that elicits (i) weed presence, (ii) spatial localization, (iii) reasoning, (iv) crop growth stage, and (v) crop type. We further introduce Error-Probing Prompting (EPP), a counterfactual follow-up that forces re-analysis under the assumption that weeds are present, and we quantify self-correction with expert-rated interpretability scores (Grounding, Specificity, Plausibility, Non-Hallucination, Actionability). Across models, Gemini Flash 2.5 delivers the most consistent zero-shot performance and highest interpretability, ChatGPT-4.1 provides the strongest reasoning but lower raw detection, ChatGPT-4o offers a balanced profile, and LLaMA-4 variants lag in localization and specificity. Gemini Flash Lite 2.5 is efficient but fails EPP stress tests, revealing brittle reasoning. Visual grounding analysis and a text-to-region overlap metric show that interpretability tracks spatial correctness. Results highlight that explainability and feedback driven adaptability not scale alone best predict reliability for field deployment, and position VLMs as promising, low-annotation tools for precision weed management.

## Full-text entities

- **Chemicals:** Gemini Flash Lite (-)
- **Species:** Glycine max (soybean, species) [taxon 3847]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12894358/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12894358/full.md

## References

73 references — full list in the complete paper: https://tomesphere.com/paper/PMC12894358/full.md

---
Source: https://tomesphere.com/paper/PMC12894358