AI-Based Multi-Object Relative State Estimation with Self-Calibration   Capabilities

Thomas Jantos; Christian Brommer; Eren Allak; Stephan Weiss; Jan; Steinbrener

arXiv:2303.00371·cs.RO·October 10, 2024

AI-Based Multi-Object Relative State Estimation with Self-Calibration Capabilities

Thomas Jantos, Christian Brommer, Eren Allak, Stephan Weiss, Jan, Steinbrener

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel AI-based multi-object relative state estimation method that fuses visual and inertial data, enabling accurate, self-calibrating localization for mobile robots in complex environments.

Contribution

It introduces a combined AI pose estimator with sensor fusion that is capable of self-calibration for multi-object relative state estimation in real-world scenarios.

Findings

01

Reliable multi-object relative state estimation demonstrated in experiments.

02

Self-calibration improves robustness and reproducibility.

03

Effective fusion of AI-based visual poses with IMU data.

Abstract

The capability to extract task specific, semantic information from raw sensory data is a crucial requirement for many applications of mobile robotics. Autonomous inspection of critical infrastructure with Unmanned Aerial Vehicles (UAVs), for example, requires precise navigation relative to the structure that is to be inspected. Recently, Artificial Intelligence (AI)-based methods have been shown to excel at extracting semantic information such as 6 degree-of-freedom (6-DoF) poses of objects from images. In this paper, we propose a method combining a state-of-the-art AI-based pose estimator for objects in camera images with data from an inertial measurement unit (IMU) for 6-DoF multi-object relative state estimation of a mobile robot. The AI-based pose estimator detects multiple objects of interest in camera images along with their relative poses. These measurements are fused with IMU…

Tables1

Table 1. TABLE I: RMSE for position and orientation across the whole trajectory for different sequences.

Sequence	#objects	(x, y, z) [m]	(roll, pitch, yaw) [deg]
1	3	[0.066, 0.163, 0.034]	[3.00, 6.81, 4.52]
2	3	[0.162, 0.358, 0.233]	[54.38, 19.39, 20.27]
3	2	[0.163, 0.338, 0.136]	[25.34, 16.57, 15.13]
4	4	[0.045, 0.157, 0.108]	[15.38, 5.58, 10.02]
5	1	[0.075, 0.031, 0.054]	[61.12, 9.66, 26.19]
6	3	[0.061, 0.117, 0.029]	[11.00, 5.77, 7.41]
7	4	[0.093, 0.159, 0.099]	[19.38, 12.38, 27.24]
8	3	[0.127, 0.165, 0.104]	[26.19, 11.95, 10.70]
mean	-	[0.099, 0.186, 0.099]	[26.97, 11.0, 15.19]
$\pm$ std	-	$\pm$ [0.043, 0.102, 0.062]	$\pm$ [19.18, 4.75, 8.01]

Equations35

X = [

X = [

p_{I C}^{T}, q_{I C}^{T}, p_{O_{0} W}^{T}, q_{O_{0} W}^{T}, \dots, p_{O_{N} W}^{T}, q_{O_{N} W}^{T}]^{T}

\dot{p}_{W I}

\dot{p}_{W I}

\dot{v}_{W I}

\dot{q}_{W I}

\tilde{z}_{p_{O_{k}}}

\tilde{z}_{p_{O_{k}}}

= p_{O_{k} C} - (p_{O_{k} W} + R_{O_{k} W} (p_{W I} + R_{W I} p_{I C}))

\tilde{z}_{R_{O_{k}}}

\tilde{z}_{q_{O_{k}}}

\tilde{z}_{O_{k}}

H_{p, p_{W I}}

H_{p, p_{W I}}

H_{p, R_{W I}}

H_{p, p_{I C}}

H_{p, p_{O_{k} W}}

H_{p, R_{O_{k} W}}

H_{R, R_{W I}}

H_{R, R_{I C}}

H_{R, R_{O_{k} W}}

H_{p, O_{k}} = [

H_{p, O_{k}} = [

H_{p, p_{I C}} H_{p, R_{I C}} H_{p, p_{O_{0} W}} H_{p, R_{O_{0} W}}

\dots H_{p, p_{O_{N} W}} H_{p, R_{O_{N} W}}]

H_{R, O_{k}} = [

H_{R, O_{k}} = [

H_{R, p_{I C}} H_{R, R_{I C}} H_{R, p_{O_{0} W}} H_{R, R_{O_{0} W}}

\dots H_{R, p_{O_{N} W}} H_{R, R_{O_{N} W}}]

H_{O_{k}} = [H_{p, O_{k}} H_{R, O_{k}}]

H_{O_{k}} = [H_{p, O_{k}} H_{R, O_{k}}]

R_{O_{k} W}

R_{O_{k} W}

p_{O_{k} W}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aau-cns/poet
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Image and Object Detection Techniques

Full text

AI-Based Multi-Object Relative State Estimation with

Self-Calibration Capabilities

Thomas Jantos1, Christian Brommer1, Eren Allak1, Stephan Weiss1 and Jan Steinbrener1 1The authors are with the Control of Networked Systems Group, University of Klagenfurt, 9020 Klagenfurt am Wörthersee, Austria [email protected] work was supported by the Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology (BMK) under the grant agreement 881082 (MUKISANO).Pre-print version, accepted Jan/2023, DOI follows ASAP ©IEEE.

Abstract

The capability to extract task specific, semantic information from raw sensory data is a crucial requirement for many applications of mobile robotics. Autonomous inspection of critical infrastructure with Unmanned Aerial Vehicles (UAVs), for example, requires precise navigation relative to the structure that is to be inspected. Recently, Artificial Intelligence (AI)-based methods have been shown to excel at extracting semantic information such as 6 degree-of-freedom (6-DoF) poses of objects from images.

In this paper, we propose a method combining a state-of-the-art AI-based pose estimator for objects in camera images with data from an inertial measurement unit (IMU) for 6-DoF multi-object relative state estimation of a mobile robot. The AI-based pose estimator detects multiple objects of interest in camera images along with their relative poses. These measurements are fused with IMU data in a state-of-the-art sensor fusion framework. We illustrate the feasibility of our proposed method with real world experiments for different trajectories and number of arbitrarily placed objects. We show that the results can be reliably reproduced due to the self-calibrating capabilities of our approach.

I INTRODUCTION

Mobile robots, such as unmanned aerial vehicles (UAVs), rely on the information of their on-board sensors to autonomously navigate the world. Semantic information, i.e. the higher-level meaning of sensor data, can improve a robot’s ability to navigate in its surroundings and allows for more complicated tasks [1]. In semantic navigation, the robot moves depending on context or task, in many cases with respect to objects of interest in the scene. Such tasks include infrastructure inspection [2] or object tracking [3]. While the goal for the latter is to keep the moving object in the field of view of the camera, infrastructure inspection requires accurate positioning of the robot with respect to a typically static object of interest. Semantic information extracted from the robot’s sensor data, namely the detection of the object of interest and its pose relative to the robot are important elements to achieving this task. For example, monitoring power pole insulators for possible damages requires a UAV to fly around the desired insulator and take high resolution images from specific positions to allow for detection of damage or changes over time.

Current autonomous mission execution is typically based on global navigation satellite system (GNSS) for localization of the UAV. GNSS always provides a global position and not a relative position with respect to an object of interest. Moreover, the accuracy is too low for precise, centimeter-range navigation and GNSS is prone to signal loss in proximity to large structures. In this case often other sensor modalities are considered, e.g. visual-inertial odometry (VIO) [4]. With VIO, a local pose can be estimated by combining the movement of geometrical features (edges, corners) in monocular camera images with data from an inertial measurement unit (IMU). Classical, feature extraction based algorithms are not well suited for semantic navigation as they rely on raw features that do not provide information about any objects in the scene and they struggle with fast or slow motion [5].

Recent advancements in artificial intelligence (AI) led to a breakthrough in the extraction of semantic information from raw sensor measurements like path detection using camera images [6], object recognition with laser scanners [7], semantic segmentation for scene understanding [8], and recently, 6 degree-of-freedom (6-DoF) pose estimation of objects for robotic grasping [9]. Furthermore, the availability of AI capable edge computing devices enables the usage of such methods on mobile robots.

In this paper, we investigate the suitability of AI-based pose estimation for full 6-DoF, object relative state estimation for mobile robotics. We consider a minimal sensor configuration consisting of a single monocular camera and an IMU in line with size, weight and computational power constraints of mobile robotic platforms such as UAVs. We utilize an AI-based pose estimator to detect, classify and estimate the 6-DoF poses of objects of interest contained in each camera image, and then fuse the information with IMU measurements in a state-of-the-art sensor fusion framework to infer the 6-DoF object-relative pose of the robot. A schematic overview of our approach is presented in Fig. 1. Our main contributions can be summarized as follows:

•

Extracting semantic information from images with AI and fusing this relative pose information with IMU data for accurate, 6-DoF, object relative state estimation.

•

Formulating a filter-based method to estimate the state of the mobile robot and the pose of multiple, different objects based on 6-DoF relative pose measurements.

•

Providing a self-calibrating formulation of the filter that does not require any assumptions about the global or relative positions of the different objects in a scene.

•

Validating the proposed approach with several real world experiments using objects of a popular 6-DoF object pose challenge data set to show that our method works for different trajectories and different number of objects with reproducible performance.

The remainder of the paper is organized as follows. In Section II, we summarize the related work. In Section III, we present how we integrate object relative pose measurements into a state estimation framework. In Section IV, the experiments and the corresponding results are discussed. Finally, the paper is concluded in Section V.

II RELATED WORK

For state estimation in mobile robotics, typically IMU data and one or more pose sensors such as GNSS are fused together. GNSS provides global position information but not 3D orientation information. In the absence of GNSS signals, VIO, the combination of a monocular camera and IMU data, can estimate the pose of a robot by triangulating the position of the camera given geometrical features from an image and estimating the remaining scale factor with inertial data [4, 10]. Fusing multiple sensors yields a more robust and reliable estimate of the robot’s state. There exist mainly two different approaches for sensor fusion: filter-based, recursive and optimization-based methods. The latter can yield more accurate state estimates but is computationally more demanding due to optimizing across several sensor measurements [11]. In comparison, filtering-based methods, such as Extended Kalman Filters (EKF) [12, 13], are computationally more efficient and thus are well-suited for mobile robotics.

GNSS and classical VIO do not provide object relative pose measurements and thus are not suitable for object relative state estimation. However, image-based 6-DoF relative object pose estimation methods can be utilized as a pose sensor in state estimation frameworks. There exist classical approaches and AI methods based on deep learning. Classical approaches are either template-based, where the object pose is determined by finding a matching template for the current image [14], or feature-based, where keypoints are extracted from the image and then matched to the 3D object model [15]. On the other hand, deep learning-based approaches are mostly end-to-end learned methods, where the 6-DoF pose is directly estimated from the input image using convolutional neural networks (CNNs). Deep learning-based methods can be divided even further depending on the amount of additional information used. [9, 16] take a single RGB image as input to their network and employ symmetry aware losses during training that make use of 3D object model information. 3D object models can also be provided as an additional input to the network [17], utilized for refining an initial pose estimate [18, 19, 20] or for matching keypoints, which were regressed by the network [21]. Other forms of additional information consist of taking multiple images [22, 23] or depth maps [24, 25]. Recently, we have proposed PoET [26] a 6-DoF multi-object pose estimation framework that achieves state-of-the-art results on benchmark datasets and only takes a single RGB image as input and does not require any additional information during training or inference.

An alternative to object relative state estimation is simultaneous localization and mapping (SLAM) on an object level. [27] uses depth images to extract 6-Dof object pose information. Similar to us, [28] use an AI-based pose estimator to predict 6-DoF relative object poses from images. Both approaches fuse the 6-DoF relative pose information from multiple view points together and combine it with graph optimization to estimate the pose of the camera and objects with respect to a map. However, they do not use any other sensors, such as IMU, in their approach. [29] combines IMU measurements, geometric features from images, and 6-DoF object poses in a SLAM approach. Graph optimization still needs to be performed for a graph containing all object poses. In general, the requirement for a map and optimization in SLAM results in a higher computational load for the mobile robot. Our approach still allows for object relative state estimation without this requirement.

Object relative state estimation for mobile robotics has been shown in [30], where a UAV localizes itself with respect to cylinder shaped infrastructure by extracting geometrical features from images and assuming a known radius. Similarly, [31] used a color-based ellipse-detection algorithm to first detect the object of interest in the image and then used the knowledge about object size, visual appearance and camera parameters to calculate the relative pose to the object. Meanwhile, [32] investigated different classical 6-DoF object pose estimation approaches for object-relative state estimation. They also investigated the use of machine learning to detect the presence of objects by training simple classifiers on classical features extracted from images.

In contrast to that, we propose here a fully AI-based method to extract semantic information from camera images. We do not need to define a geometric model, keypoints or templates to map object appearances in images to relative 6-DoF poses. Moreover, AI-based models are not limited to specific geometric object shapes and remove the need for handcrafted features. In our previous work [26], we have introduced PoET for 6-DoF pose estimation of objects in RGB images using state-of-the-art AI methods. We mainly focused on the definition, the training, a thorough ablation study and comparison to other deep learning-based methods on benchmark datasets for 6-DoF multi-object pose estimation. In this work, we present a detailed investigation of the suitability of our AI-based object pose estimator as pose sensor for 6-DoF object-relative state estimation of a mobile robot using a state-of-the-art sensor fusion framework with multiple real world experiments.

III METHOD

In this section, we present the design of our approach. First, we explain the notation used for the measurement equations and transformations of coordinate frames. Second, we reason about the choice of frameworks for 6-DoF pose estimation and state estimation. Finally, we describe how the estimated 6-DoF of several known objects can be combined to estimate the 6-DoF pose of the robot. This includes a detailed description of how our choice of sensor fusion algorithm is extended to include 6-DoF pose measurements of each individual object.

III-A Notation

Throughout this paper we use the following notation: given three coordinate frames $A$ , $B$ and $C$ , the transformation ${}_{{\scriptscriptstyle A}}{\mathbf{T}}_{\scriptscriptstyle BC}$ defines frame $C$ with respect to frame $B$ expressed in frame $A$ . If the left subscript $A$ is omitted, the transformation is defined in frame $B$ . Furthermore, the transformation ${}_{{\scriptscriptstyle A}}{\mathbf{T}}_{\scriptscriptstyle AB}$ can be split up into two parts namely ${}_{{\scriptscriptstyle A}}{\mathbf{p}}_{\scriptscriptstyle AB}$ and ${\mathbf{R}}_{\scriptscriptstyle AB}$ , which describe the translation and rotation respectively. Alternatively, the rotation can also be expressed by a quaternion ${\mathbf{q}}_{\scriptscriptstyle AB}$ . Each quaternion $\mathbf{q}$ can be represented by $\mathbf{q}=[\mathbf{q_{v}}~{}q_{w}]^{T}=[q_{x}~{}q_{y}~{}q_{z}~{}q_{w}]^{T}$ . The quaternion multiplication is represented by $\otimes$ . $\mathbf{I}_{3}$ and $\mathbf{0}_{3}$ refer to the identity and the null matrix in $\mathbb{R}^{3x3}$ , respectively. $[\omega]_{\times}$ is the skew-symmetric operator as defined in [33].

III-B Pose and State Estimation Frameworks

Mobile robots, in particular UAVs, are subject to payload constraints, which impose not only limitations on the size and amount of sensors a robot can carry, but also on the computational power available for data processing. Hence, the necessity arises for efficient and computationally light algorithms. Therefore, we chose our object pose estimation framework PoET [26] as a 6-DoF pose sensor as it only uses RGB images and does not rely on any depth information and 3D object models, removing the need for additional hardware components and reducing the computational load by not having to process 3D models. In a first step, PoET detects all objects it was trained to detect in an image and also predicts their classes. Afterwards, the predicted bounding boxes and multi-scale feature maps are fed to a transformer architecture to predict the relative, up-to scale 6-DoF pose between the camera and each object. The predicted rotation and translation are unique for non-symmetric objects. For objects with one or more symmetry axes, the rotation or translation for some object poses becomes ambiguous with more than one possible solution. The obvious negative effects of this ambiguity on the pose estimation of the robot can be minimized by considering multiple objects in heterogeneous configuration and fusing individual measurements in a proper sensor fusion framework. This mimics an inspection workflow where typically several distinct parts of interest of the structure to be inspected are visible at the same time.

For the sensor fusion framework, we use MaRS [12] for multi-sensor fusion and state estimation due to being lightweight and computationally efficient as it was developed specifically with mobile robotics in mind. MaRS was designed for modularity and separates the propagation of the core state variables based on inertial data from the state updates based on the measurements of the individual sensors. It also uses abstract sensor classes that are type agnostic. This allows for straightforward integration of new sensor modules. For our method, we define a multi-pose sensor, where a single measurement consists of a single RGB image. From each image, we then extract the 6-DoF relative poses of all detected objects with PoET and use them for the EKF update step as described below.

III-C EKF State and Update

As reference frame for the mobile robotic system, we chose the frame of its IMU ( $I$ ). Thus, our goal is to estimate the pose of the IMU ( $I$ ) with respect to the world ( $W$ ) by measuring the 6-DoF relative poses between the camera ( $C$ ) and a set of objects ( $O_{k})$ . The different coordinate frames are visualized in Fig. 1. As mentioned earlier, the relative poses of the objects with respect to the camera are extracted from the image by our AI-based pose estimator dubbed PoET. PoET will only consider objects that it was trained for. Given a single RGB image, a 6-DoF pose ${\mathbf{T}}_{\scriptscriptstyle CO_{k}}$ is predicted for each detected object of interest and the assignment of the predicted pose to an object is based on the predicted class. For details about the architectural choices in PoET, we refer the reader to [26]. Depending on the total number of objects $N$ in a scene, the full state vector $\mathbf{X}$ is then defined as:

[TABLE]

The core states necessary for state propagation are the position ${\mathbf{p}}_{\scriptscriptstyle WI}$ of the IMU, its velocity ${\mathbf{v}}_{\scriptscriptstyle WI}$ and its orientation ${\mathbf{q}}_{\scriptscriptstyle WI}$ as well as the gyroscopic bias ${\mathbf{b}}_{\scriptscriptstyle\omega}$ and the accelerometer bias ${\mathbf{b}}_{\scriptscriptstyle a}$ . The pose and velocity dynamics are given as [4]

[TABLE]

where ${\mathbf{a}}_{\scriptscriptstyle m}$ is the measured acceleration in the IMU frame, ${\mathbf{n}}_{\scriptscriptstyle a}$ is the accelerometer noise parameter, $\mathbf{g}$ is the gravity vector in $W$ , ${\mathbf{\omega}}_{\scriptscriptstyle b}$ is the measured angular velocity in the IMU frame, ${\mathbf{n}}_{\scriptscriptstyle\omega}$ is the gyroscopic noise parameter, and $\Omega(\omega)$ is the quaternion multiplication matrix of $\omega$ . The IMU biases are modeled as random walks.

Furthermore, we estimate the calibration between the IMU and the camera given by ${\mathbf{p}}_{\scriptscriptstyle IC}$ and ${\mathbf{q}}_{\scriptscriptstyle IC}$ . Due to implementation reasons we assume the number of objects in a scene to be known a priori, but neither the global poses of the objects nor the relative poses between objects is known. Therefore, we additionally estimate an object-world which describes the transformation ( ${\mathbf{p}}_{\scriptscriptstyle O_{k}W},{\mathbf{q}}_{\scriptscriptstyle O_{k}W}$ ) between the object frame and the navigation world. Both the camera-IMU extrinsics ${\mathbf{p}}_{\scriptscriptstyle IC},{\mathbf{q}}_{\scriptscriptstyle IC}$ as well as the object poses ${\mathbf{p}}_{\scriptscriptstyle O_{k}W},{\mathbf{q}}_{\scriptscriptstyle O_{k}W}$ are modeled to remain constant over time.

For each image, every measured relative pose is treated as an independent measurement with which the pose of the camera can be estimated. To calculate the required Jacobians for the update step, we consider the inverted relative pose measurements ${\mathbf{T}}_{\scriptscriptstyle O_{k}C}$ , i.e. the camera pose relative to the object frame. Based on the relationship between the different coordinate frames and the independent relative position $z_{\mathbf{p}_{O_{k}}}$ and orientation $z_{\mathbf{q}_{O_{k}}}$ measurements, the residuals for position $\tilde{z}_{\mathbf{p}_{O_{k}}}$ and orientation $\tilde{z}_{\mathbf{R}_{O_{k}}}$ can be written as:

[TABLE]

Given these residuals and depending on a single pose measurement from object $O_{k}$ , the Jacobian for the position $H_{\mathbf{p}}$ and orientation $H_{\mathbf{R}}$ with respect to the states is [33]:

[TABLE]

where, e.g. $H_{\mathbf{p},{\mathbf{p}}_{\scriptscriptstyle WI}}$ only considers the part of the residual $\tilde{z}_{\mathbf{p}_{O_{k}}}$ that depends on the state ${\mathbf{p}}_{\scriptscriptstyle WI}$ . The rest of the Jacobians are equal to $\textbf{0}_{3}$ . As relative pose measurements for different objects are independent of each other, the Jacobians for the other ( $i\neq n$ ) object-world states, i.e. $H_{\mathbf{p},{\mathbf{p}}_{\scriptscriptstyle O_{i}W}}$ , $H_{\mathbf{p},{\mathbf{R}}_{\scriptscriptstyle O_{i}W}}$ , $H_{\mathbf{R},{\mathbf{p}}_{\scriptscriptstyle O_{i}W}}$ , $H_{\mathbf{R},{\mathbf{R}}_{\scriptscriptstyle O_{i}W}}$ , are all equal to $\mathbf{0}_{3}$ . For a single object $O_{k}$ , the Jacobian is given by stacking the individual components:

[TABLE]

Depending on the current image, the final residual $z$ and observation matrix $\mathbf{H}$ for the state update is determined by vertically stacking the residuals and Jacobians, individually, from Eq. 8 and Eq. 19, respectively, for each object that was detected for the current update step. Similar to hardware sensors, our AI-based pose sensor might return faulty or inaccurate measurements. In a similar fashion as described in [34], we conduct a $\chi^{2}$ test based on the EKF innovation $\mathbf{S}$ and the residual to detect outlier measurements. This is applied for the measurement of each object individually. Outlier measurements for object $O_{k}$ are then rejected and the final residual and Jacobian have to be rebuild by masking the corresponding rows. The correction is then calculated based on this final total residual and associated Jacobian. In the update step, measurement uncertainties for each measurement can be considered. For the present work, these uncertainties have been fixed to 10 cm and 20 degrees for the position and orientation measurement of each object, respectively. These numbers were determined based on the standard deviation of PoET across a video sequence reported in [26].

The proper initialization of the individual frames is an important aspect to consider. At the beginning of the recording, we initialize an arbitrary but fixed navigation world $W$ . Without loss of generality, the IMU frame is initialized at the origin of $W$ . The initial extrinsic calibration between the IMU and camera is determined through visual-inertial calibration [35]. Each object-world is initialized when the corresponding object is seen by the camera for the first time. The object frame is then placed with respect to the world frame by taking the currently estimated pose of the camera and the relative pose measurement:

[TABLE]

In our problem formulation, the robot’s pose $I$ is estimated relative to a set of object-worlds $O_{k}$ through relative pose measurements. As the robot’s pose is relative to a world frame and the measurements are relative to the corresponding object frames, the object-worlds can be placed freely with respect to the world frame, which does cause observability issues. By fixing the state of one object-world reference frame, the system is rendered observable. This fixed object, from now on called the main object $O_{m}$ , serves as the anchor point for the object-relative 6-DoF state estimation. The position of the main object’s frame with respect to the navigation world is not changed, i.e. the corresponding Jacobians $H_{\mathbf{p},{\mathbf{p}}_{\scriptscriptstyle O_{m}W}}$ and $H_{\mathbf{R},{\mathbf{p}}_{\scriptscriptstyle O_{m}W}}$ are set to $\mathbf{0}_{3}$ . Measurements in which the main object is not visible in the picture are directly rejected. Otherwise, estimating the object-world of additional objects while the anchor is not visible leads to ambiguous updates.

The propagation step is performed at the rate of the IMU sensor readings, while the update step happens with the frequency of the camera images.

IV EXPERIMENTS & RESULTS

In this section, we present the experiments conducted and discuss the results in detail. We trained PoET on the YCB-V dataset [9] as described in [26], a benchmark dataset for 6-DoF pose estimation, and chose a subset of objects to serve as objects of interest in our experiments. The images and IMU data were recorded using an Intel Realsense D435i with an RGB resolution of 1280x720 and 30 FPS. After undistorting the images, they were cropped to a resolution of 640x480, which is the standard resolution of the YCB-V dataset. We record our own real data by placing the objects in our motion capture room and moving around the objects with the camera while tracking the body of the camera. This mimics the inspection of a set of objects of interest with a mobile robotic platform with 6-DoFs. An example object configuration and image can be found in Fig. 2. Because we do not record any information regarding the global position of the objects in the room, the trajectory derived from the object-relative state estimation has to be aligned with the ground truth trajectory to calculate the error metrics. It is important to note the differences between the benchmark data and our own real data. The camera which was used to record the YCB-V dataset has a different recording resolution, field of view, and focal point than the camera used during our experiments. In addition, some real world objects had slightly different appearance (size and coloring) than the ones used for the data set that PoET was trained with. The results reported here have been obtained with the YCB-V trained model of PoET. We did not perform any retraining or fine-tuning of the model to adapt to these differences.

We conducted two different experiments. First, we investigated the performance of our approach for 8 different sequences. The sequences varied with respect to the number of objects present, the constellation of the objects and the trajectory performed by the camera around the objects. We calculate the root mean square error (RMSE) for the position and Euler angles by comparing the estimated trajectory with the respective ground truth trajectory for a single recording of that sequence. Moreover, we calculate the average RMSE and the standard deviation (std) over all trajectories. The results are summarized in Table III. The overall performance across all sequences shows that our method is able to sufficiently estimate the state given AI-based, relative 6-DoF pose measurements for a variety of scenarios. For most sequences, the position RMSE is around 10 cm or less, which is an acceptable error for this rather complex task. Furthermore, these results indicate that our method is applicable for object relative state estimation. In some cases, i.e. sequence 2 and 3, the achieved performance is worse than in others. This is due to outliers in the predictions of the 6-DoF pose estimator. Objects being partially out of the image or ambiguous viewpoints as well as motion blur in the images lead to wrong pose estimates. Especially the latter one results in the predictions for all objects in a single image to be wrong. While the $\chi^{2}$ test helps to reject such frames, multiple consecutive images with noisy or wrong measurements will still affect the state estimation as it can result in phases with no updates and only IMU propagation leading to deadreckoning. In such cases, a prediction of uncertainties for each object and measurement would lead to better results rather than working with the fixed values described above. Integration of aleatoric and epistemic uncertainties for the predictions of PoET is subject to future work.

To illustrate the reproducibility of our approach, we chose two out of the 8 sequences (sequence 4 and sequence 6) and repeated each sequence 10 times resulting in similar but not exactly the same trajectories. For each sequence, the RMSE across the whole trajectory for each run and the average RMSE and std across all runs are summarized in Table III and Table III, respectively. For both sequences, we are able to reproduce the performance of our method across 10 independent runs with a low standard deviation. This shows an AI-based component can be reliably incorporated into the state estimation of a robot.

Moreover, we compare the estimated and the ground truth position and orientation across the whole trajectory for an example recording in Fig. 3. This example shows that our approach reliably estimates the position and orientation for the whole duration of the recording. Furthermore, the graphs show that the raw measurements of a single object sometimes lead to a reprojected IMU pose that does not align with the ground truth trajectory. However, by fusing IMU information with pose measurements from multiple objects our method is able to reliably estimate the trajectory, despite outlier measurements. In addition to that, we show in Fig. 4 an example for the self-calibration capabilities of our approach with respect to the object-world states. The object-world is wrongly initialized after it was first observed due to a possible noisy measurement. Nonetheless, the state converges after 5 seconds.

V CONCLUSIONS

In this paper, we investigated object relative state estimation for mobile robots with an AI-based method to extract semantic information (object class and pose) from single RGB images. We defined a minimal sensor configuration, consisting of an RGB camera and IMU, and an experimental scenario in which object relative state estimation is required, mimicking the task of inspection and monitoring. We derived and implemented a filter-based solution for full state estimation of a mobile robot given 6-DoF relative pose measurements. Additionally, our method does not require any initial information about the global and relative poses of the objects. By defining object-world states, the coordinate frame of each object is estimated concurrently with respect to a common navigation world by using one of the objects as an anchor point. Our experiments with own real data showed that our method can be used for state estimation of the mobile robot in different scenarios and that the results can be reliably reproduced. Our results show that AI-based, semantic information from a single sensor is sufficient in combination with IMU data for accurate state estimation. Future work will consider incorporating aleatoric and epistemic uncertainties of the AI-based predictions in the sensor fusion framework for improved outlier rejection as well the integration of our proposed approach on a real UAV for closed loop experiments.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Crespo, J. C. Castillo, O. M. Mozos, and R. Barber, “Semantic information for robot navigation: A survey,” Applied Sciences , vol. 10, no. 2, p. 497, 2020.
2[2] S. Jordan, J. Moore, S. Hovet, J. Box, J. Perry, K. Kirsche, D. Lewis, and Z. T. H. Tse, “State-of-the-art technologies for uav inspections,” IET Radar, Sonar & Navigation , vol. 12, no. 2, pp. 151–164, 2018.
3[3] Y. Li, C. Fu, Z. Huang, Y. Zhang, and J. Pan, “Keyfilter-aware real-time uav object tracking,” in 2020 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2020, pp. 193–199.
4[4] S. Weiss, D. Scaramuzza, and R. Siegwart, “Monocular-slam–based navigation for autonomous micro helicopters in gps-denied environments,” Journal of Field Robotics , vol. 28, no. 6, pp. 854–874, 2011.
5[5] E. Allak, A. Hardt-Stremayr, and S. Weiss, “Key-frame strategy during fast image-scale changes and zero motion in VIO without persistent features,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, oct 2018.
6[6] A. Giusti, J. Guzzi, D. C. Cireşan, F.-L. He, J. P. Rodríguez, F. Fontana, M. Faessler, C. Forster, J. Schmidhuber, G. Di Caro, et al. , “A machine learning approach to visual perception of forest trails for mobile robots,” IEEE Robotics and Automation Letters , vol. 1, no. 2, pp. 661–667, 2015.
7[7] R. Dominguez, E. Onieva, J. Alonso, J. Villagra, and C. Gonzalez, “Lidar based perception solution for autonomous vehicles,” in 2011 11th International Conference on Intelligent Systems Design and Applications . IEEE, 2011, pp. 790–795.
8[8] M. Hofmarcher, T. Unterthiner, J. Arjona-Medina, G. Klambauer, S. Hochreiter, and B. Nessler, “Visual scene understanding for autonomous driving using semantic segmentation,” in Explainable AI: Interpreting, Explaining and Visualizing Deep Learning . Springer, 2019, pp. 285–296.