Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection
Lars Lundqvist, Earl Ranario, Hamid Kamangir, Heesup Yun, Christine Diepenbrock, Brian N. Bailey, J. Mason Earles

TL;DR
This paper systematically optimizes prompts for vision foundation models to improve zero-shot object detection in agricultural scenes, demonstrating significant performance gains and transferability across models and tasks.
Contribution
It introduces a prompt optimization framework that reveals model-specific prompt structures, significantly enhancing zero-shot detection accuracy in complex agricultural environments.
Findings
Optimized prompts improve detection performance by up to +0.362 [email protected].
Prompt structures optimized on synthetic data transfer effectively to real-world data.
Model-specific prompt optimization yields substantial gains over naive baseline prompts.
Abstract
Vision foundation models (VFMs) offer the promise of zero-shot object detection without task-specific training data, yet their performance in complex agricultural scenes remains highly sensitive to text prompt construction. We present a systematic prompt optimization framework evaluating four open-vocabulary detectors -- YOLO World, SAM3, Grounding DINO, and OWLv2 -- for cowpea flower and pod detection across synthetic and real field imagery. We decompose prompts into eight axes and conduct one-factor-at-a-time analysis followed by combinatorial optimization, revealing that models respond divergently to prompt structure: conditions that optimize one architecture can collapse another. Applying model-specific combinatorial prompts yields substantial gains over a naive species-name baseline, including +0.357 [email protected] for YOLO World and +0.362 [email protected] for OWLv2 on synthetic cowpea flower…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
