MOLTR: Multiple Object Localisation, Tracking, and Reconstruction from Monocular RGB Videos
Kejie Li, Hamid Rezatofighi, Ian Reid

TL;DR
MOLTR is a monocular RGB video-based system that localizes, tracks, and reconstructs multiple objects with semantic understanding, enabling object-centric mapping for robotics and AR/VR.
Contribution
It introduces a novel online method for object-centric mapping using monocular videos, combining 3D detection, shape embedding, and Bayesian filtering.
Findings
Superior performance on indoor and outdoor datasets
Effective object localization and tracking
Progressive shape refinement
Abstract
Semantic aware reconstruction is more advantageous than geometric-only reconstruction for future robotic and AR/VR applications because it represents not only where things are, but also what things are. Object-centric mapping is a task to build an object-level reconstruction where objects are separate and meaningful entities that convey both geometry and semantic information. In this paper, we present MOLTR, a solution to object-centric mapping using only monocular image sequences and camera poses. It is able to localise, track, and reconstruct multiple objects in an online fashion when an RGB camera captures a video of the surrounding. Given a new RGB frame, MOLTR firstly applies a monocular 3D detector to localise objects of interest and extract their shape codes that represent the object shapes in a learned embedding space. Detections are then merged to existing objects in the map…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · 3D Surveying and Cultural Heritage · Advanced Vision and Imaging
MethodsAttentive Walk-Aggregating Graph Neural Network
