Objects as Spatio-Temporal 2.5D points
Paridhi Singh, Gaurav Singh, Arun Kumar

TL;DR
This paper introduces a lightweight, weakly supervised method for estimating 3D object positions in bird's eye view by jointly learning 2D detections and scene depth, without requiring 3D annotations.
Contribution
It extends a single-shot detector to model objects as spatio-temporal BEV points using only 2D supervision and LiDAR during training, eliminating the need for 3D annotations.
Findings
Achieves comparable accuracy to state-of-the-art methods on KITTI benchmark.
Over 10x computational efficiency compared to recent approaches.
Effectively models object tracks as BEV points without 3D annotations.
Abstract
Determining accurate bird's eye view (BEV) positions of objects and tracks in a scene is vital for various perception tasks including object interactions mapping, scenario extraction etc., however, the level of supervision required to accomplish that is extremely challenging to procure. We propose a light-weight, weakly supervised method to estimate 3D position of objects by jointly learning to regress the 2D object detections and scene's depth prediction in a single feed-forward pass of a network. Our proposed method extends a center-point based single-shot object detector, and introduces a novel object representation where each object is modeled as a BEV point spatio-temporally, without the need of any 3D or BEV annotations for training and LiDAR data at query time. The approach leverages readily available 2D object supervision along with LiDAR point clouds (used only during training)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques · Marine animal studies overview
