Learn from Foundation Model: Fruit Detection Model without Manual Annotation
Yanan Wang, Zhenghao Fei, Ruichen Li, Yibin Ying

TL;DR
This paper introduces SDM-D, a framework that leverages foundation models and knowledge distillation to train effective fruit detection models without manual annotations, achieving near-supervised performance.
Contribution
The paper presents a novel framework combining foundation models and knowledge distillation to train domain-specific fruit detection models without manual labels.
Findings
SDM-D nearly matches performance of label-supervised models.
SDM-D outperforms open-set detection methods like Grounding SAM and YOLO-World.
Introduces MegaFruits dataset with over 25,000 images.
Abstract
Recent breakthroughs in large foundation models have enabled the possibility of transferring knowledge pre-trained on vast datasets to domains with limited data availability. Agriculture is one of the domains that lacks sufficient data. This study proposes a framework to train effective, domain-specific, small models from foundation models without manual annotation. Our approach begins with SDM (Segmentation-Description-Matching), a stage that leverages two foundation models: SAM2 (Segment Anything in Images and Videos) for segmentation and OpenCLIP (Open Contrastive Language-Image Pretraining) for zero-shot open-vocabulary classification. In the second stage, a novel knowledge distillation mechanism is utilized to distill compact, edge-deployable models from SDM, enhancing both inference speed and perception accuracy. The complete method, termed SDM-D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSmart Agriculture and AI · Vehicle License Plate Recognition · Soil and Land Suitability Analysis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Segment Anything Model · Knowledge Distillation
