Co-training for Deep Object Detection: Comparing Single-modal and   Multi-modal Approaches

Jose L. G\'omez; Gabriel Villalonga; Antonio M. L\'opez

arXiv:2104.11619·cs.CV·May 5, 2021

Co-training for Deep Object Detection: Comparing Single-modal and Multi-modal Approaches

Jose L. G\'omez, Gabriel Villalonga, Antonio M. L\'opez

PDF

TL;DR

This paper explores co-training methods for deep object detection, comparing single-modal (appearance) and multi-modal (appearance and depth) approaches, demonstrating multi-modal's advantages especially under domain shift conditions.

Contribution

It introduces a multi-modal co-training framework for self-labeling object bounding boxes and compares its effectiveness against single-modal methods in various domain shift scenarios.

Findings

01

Multi-modal co-training outperforms single-modal in standard SSL settings.

02

Multi-modal approach is more robust under domain shift, especially with GAN-based translation.

03

Using off-the-shelf depth estimation models suffices without retraining on translated images.

Abstract

Top-performing computer vision models are powered by convolutional neural networks (CNNs). Training an accurate CNN highly depends on both the raw sensor data and their associated ground truth (GT). Collecting such GT is usually done through human labeling, which is time-consuming and does not scale as we wish. This data labeling bottleneck may be intensified due to domain shifts among image sensors, which could force per-sensor data labeling. In this paper, we focus on the use of co-training, a semi-supervised learning (SSL) method, for obtaining self-labeled object bounding boxes (BBs), i.e., the GT to train deep object detectors. In particular, we assess the goodness of multi-modal co-training by relying on two different views of an image, namely, appearance (RGB) and estimated depth (D). Moreover, we compare appearance-based single-modal co-training with multi-modal. Our results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.