Track Any Peppers: Weakly Supervised Sweet Pepper Tracking Using VLMs
Jia Syuen Lim, Yadan Luo, Zhi Chen, Tianqi Wei, Scott Chapman, and Zi, Huang

TL;DR
This paper introduces TAP, a weakly supervised method leveraging foundation models for accurate sweet pepper detection and tracking in videos with minimal manual labeling, achieving high performance metrics.
Contribution
We propose TAP, a novel ensemble approach combining vision-language models and traditional tracking algorithms for efficient agricultural object tracking.
Findings
Achieved HOTA score of 80.4% in sweet pepper tracking
Reduced manual labeling through pseudo-label generation
Enhanced detection accuracy with relighting and depth filtering
Abstract
In the Detection and Multi-Object Tracking of Sweet Peppers Challenge, we present Track Any Peppers (TAP) - a weakly supervised ensemble technique for sweet peppers tracking. TAP leverages the zero-shot detection capabilities of vision-language foundation models like Grounding DINO to automatically generate pseudo-labels for sweet peppers in video sequences with minimal human intervention. These pseudo-labels, refined when necessary, are used to train a YOLOv8 segmentation network. To enhance detection accuracy under challenging conditions, we incorporate pre-processing techniques such as relighting adjustments and apply depth-based filtering during post-inference. For object tracking, we integrate the Matching by Segment Anything (MASA) adapter with the BoT-SORT algorithm. Our approach achieves a HOTA score of 80.4%, MOTA of 66.1%, Recall of 74.0%, and Precision of 90.7%, demonstrating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPlant Pathogens and Fungal Diseases · Advanced Chemical Sensor Technologies · Insect Pheromone Research and Control
MethodsAttention Is All You Need · Softmax · Linear Layer · Dense Connections · Layer Normalization · Multi-Head Attention · Residual Connection · Vision Transformer · You Only Look Once · Adapter
