Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching
Uday Bhaskar, Rishabh Bhattacharya, Avinash Patel, Sarthak Khoche, Praveen Anil Kulkarni, Naresh Manwani

TL;DR
This paper presents a novel per-object co-teaching pipeline that leverages vision-language models to generate pseudo-labels for training real-time, high-performance object detectors in autonomous driving, reducing reliance on manual annotations.
Contribution
The work introduces a per-object co-teaching strategy that filters noisy VLM-generated labels at the object level, improving detection accuracy and robustness.
Findings
Outperforms baseline YOLOv5m with 46.61% [email protected] on KITTI
Adding 10% ground truth labels boosts [email protected] to 57.97%
Achieves real-time detection suitable for autonomous driving
Abstract
Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers' per-object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
