TL;DR
CropVLM is a domain-adapted vision-language model designed for open-set crop analysis, enabling scalable, zero-shot plant phenotyping and detection without extensive species-specific training.
Contribution
We introduce CropVLM with domain-specific semantic alignment and HOS-Net for open-set crop detection, advancing scalable phenotyping in agriculture.
Findings
Achieves 72.51% zero-shot classification accuracy, outperforming baselines.
Demonstrates superior zero-shot detection with 49.17 AP50 on CVTCropDet.
Outperforms existing methods on tropical fruit species detection.
Abstract
High-throughput plant phenotyping, the quantitative measurement of observable plant traits, is critical for modern breeding but remains constrained by a "phenotyping bottleneck," where manual data collection is labor-intensive and prone to observer bias. Conventional closed-set computer vision systems fail to address this challenge, as they require extensive species-specific annotation and lack the flexibility to handle diverse breeding populations. To bridge this gap, we present CropVLM, a Vision-Language Model (VLM) adapted for the agricultural domain via Domain-Specific Semantic Alignment (DSSA). Trained on 52,987 manually selected image-caption pairs covering 37 species in natural field conditions, CropVLM effectively maps agronomic terminology to fine-grained visual features. We further introduce the Hybrid Open-Set Localization Network (HOS-Net), an architecture that integrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
