Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching

Uday Bhaskar; Rishabh Bhattacharya; Avinash Patel; Sarthak Khoche; Praveen Anil Kulkarni; Naresh Manwani

arXiv:2511.09955·cs.CV·November 14, 2025

Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching

Uday Bhaskar, Rishabh Bhattacharya, Avinash Patel, Sarthak Khoche, Praveen Anil Kulkarni, Naresh Manwani

PDF

Open Access

TL;DR

This paper presents a novel per-object co-teaching pipeline that leverages vision-language models to generate pseudo-labels for training real-time, high-performance object detectors in autonomous driving, reducing reliance on manual annotations.

Contribution

The work introduces a per-object co-teaching strategy that filters noisy VLM-generated labels at the object level, improving detection accuracy and robustness.

Findings

01

Outperforms baseline YOLOv5m with 46.61% [email protected] on KITTI

02

Adding 10% ground truth labels boosts [email protected] to 57.97%

03

Achieves real-time detection suitable for autonomous driving

Abstract

Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers' per-object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning