TL;DR
Mono-Hydra++ is a real-time monocular RGB-IMU system that constructs 3D indoor scene graphs, enabling semantic understanding for lightweight robots without active depth sensors.
Contribution
It introduces a novel multi-task deep model and a pipeline for real-time semantic mapping and scene graph construction using only monocular RGB and IMU data.
Findings
Achieves 1.6% lower trajectory error than RGB-D baseline on ScanNet
Improves average ATE by 29.8% over calibrated baselines on 7-Scenes
Runs at 25.53 FPS on Jetson Orin NX with embedded perception model
Abstract
Autonomous agile robots need more than metric geometry: they must understand objects, rooms, places, and spatial relations for search, inspection, exploration, and human robot interaction. Conventional metric maps support localization and collision avoidance, but do not provide this semantic and relational structure. 3D scene graphs address this gap by connecting geometry with object level and room level understanding. Building such representations on agile platforms remains difficult because aerial and lightweight robots operate under strict payload, power, and compute limits, making RGB-D cameras and LiDAR sensors impractical for many onboard settings. We present Mono-Hydra++, a real time monocular RGB plus IMU pipeline for indoor metric semantic mapping and hierarchical 3D scene graph construction. The system combines M2H-MX, a DINOv3 based multi-task model for depth and semantics,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
