MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos
Yihong Sun, Bharath Hariharan

TL;DR
MOD-UV is a novel unsupervised mobile object detection method that learns from unlabeled videos by leveraging motion cues, achieving state-of-the-art results without external data or supervision.
Contribution
It introduces a new training paradigm that progressively discovers small and static-but-mobile objects, enhancing unsupervised detection from unlabeled videos.
Findings
Achieves state-of-the-art unsupervised detection on Waymo, nuScenes, and KITTI datasets.
Effectively detects and segments mobile objects from a single static image.
Does not require external data or supervised models.
Abstract
Embodied agents must detect and localize objects of interest, e.g. traffic participants for self-driving cars. Supervision in the form of bounding boxes for this task is extremely expensive. As such, prior work has looked at unsupervised instance detection and segmentation, but in the absence of annotated boxes, it is unclear how pixels must be grouped into objects and which objects are of interest. This results in over-/under-segmentation and irrelevant objects. Inspired by human visual system and practical applications, we posit that the key missing cue for unsupervised detection is motion: objects of interest are typically mobile objects that frequently move and their motions can specify separate instances. In this paper, we propose MOD-UV, a Mobile Object Detector learned from Unlabeled Videos only. We begin with instance pseudo-labels derived from motion segmentation, but introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
