MonoSOWA: Scalable monocular 3D Object detector Without human Annotations
Jan Skvrna, Lukas Neumann

TL;DR
MonoSOWA introduces a scalable, annotation-free monocular 3D object detection method that leverages a novel motion model, enabling effective training without human labels and outperforming prior approaches.
Contribution
It presents a new training approach for 3D detection from monocular images without human annotations, using a Local Object Motion Model and dataset aggregation techniques.
Findings
Outperforms previous methods on three datasets without human labels.
Approximately 700 times faster than prior work.
Effective as a pre-training tool for supervised learning.
Abstract
Inferring object 3D position and orientation from a single RGB camera is a foundational task in computer vision with many important applications. Traditionally, 3D object detection methods are trained in a fully-supervised setup, requiring LiDAR and vast amounts of human annotations, which are laborious, costly, and do not scale well with the ever-increasing amounts of data being captured. We present a novel method to train a 3D object detector from a single RGB camera without domain-specific human annotations, making orders of magnitude more data available for training. The method uses newly proposed Local Object Motion Model to disentangle object movement source between subsequent frames, is approximately 700 times faster than previous work and compensates camera focal length differences to aggregate multiple datasets. The method is evaluated on three public datasets, where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
