MonoSOWA: Scalable monocular 3D Object detector Without human Annotations

Jan Skvrna; Lukas Neumann

arXiv:2501.09481·cs.CV·June 23, 2025

MonoSOWA: Scalable monocular 3D Object detector Without human Annotations

Jan Skvrna, Lukas Neumann

PDF

Open Access

TL;DR

MonoSOWA introduces a scalable, annotation-free monocular 3D object detection method that leverages a novel motion model, enabling effective training without human labels and outperforming prior approaches.

Contribution

It presents a new training approach for 3D detection from monocular images without human annotations, using a Local Object Motion Model and dataset aggregation techniques.

Findings

01

Outperforms previous methods on three datasets without human labels.

02

Approximately 700 times faster than prior work.

03

Effective as a pre-training tool for supervised learning.

Abstract

Inferring object 3D position and orientation from a single RGB camera is a foundational task in computer vision with many important applications. Traditionally, 3D object detection methods are trained in a fully-supervised setup, requiring LiDAR and vast amounts of human annotations, which are laborious, costly, and do not scale well with the ever-increasing amounts of data being captured. We present a novel method to train a 3D object detector from a single RGB camera without domain-specific human annotations, making orders of magnitude more data available for training. The method uses newly proposed Local Object Motion Model to disentangle object movement source between subsequent frames, is approximately 700 times faster than previous work and compensates camera focal length differences to aggregate multiple datasets. The method is evaluated on three public datasets, where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection