Estimating Metric Poses of Dynamic Objects Using Monocular Visual-Inertial Fusion
Kejie Qiu, Tong Qin, Hongwen Xie, Shaojie Shen

TL;DR
This paper introduces a monocular visual-inertial system that estimates the metric 6-DoF pose of dynamic objects without prior scale knowledge by leveraging IMU data and trajectory optimization, enabling accurate 3D tracking.
Contribution
It presents a novel method to recover the metric scale of dynamic objects using monocular visual-inertial fusion without fixed multi-camera or depth sensors, enhancing 3D object tracking.
Findings
Achieves accurate 6-DoF pose estimation of dynamic objects.
Demonstrates improved tracking accuracy over ground truth.
Enables real-time augmented reality applications.
Abstract
A monocular 3D object tracking system generally has only up-to-scale pose estimation results without any prior knowledge of the tracked object. In this paper, we propose a novel idea to recover the metric scale of an arbitrary dynamic object by optimizing the trajectory of the objects in the world frame, without motion assumptions. By introducing an additional constraint in the time domain, our monocular visual-inertial tracking system can obtain continuous six degree of freedom (6-DoF) pose estimation without scale ambiguity. Our method requires neither fixed multi-camera nor depth sensor settings for scale observability, instead, the IMU inside the monocular sensing suite provides scale information for both camera itself and the tracked object. We build the proposed system on top of our monocular visual-inertial system (VINS) to obtain accurate state estimation of the monocular camera…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Advanced Image and Video Retrieval Techniques
