Visual-Inertial Mapping with Non-Linear Factor Recovery

Vladyslav Usenko; Nikolaus Demmel; David Schubert; J\"org St\"uckler,; Daniel Cremers

arXiv:1904.06504·cs.CV·June 2, 2020

Visual-Inertial Mapping with Non-Linear Factor Recovery

Vladyslav Usenko, Nikolaus Demmel, David Schubert, J\"org St\"uckler,, Daniel Cremers

PDF

5 Repos

TL;DR

This paper introduces a novel method for visual-inertial mapping that reconstructs non-linear factors from odometry data, enhancing global consistency and accuracy in environment mapping.

Contribution

It proposes a non-linear factor recovery approach from visual-inertial odometry to improve global mapping robustness and accuracy.

Findings

01

Outperforms state-of-the-art methods on public benchmarks.

02

Enhances global map consistency through loop closure integration.

03

Improves orientation observability with VIO factors.

Abstract

Cameras and inertial measurement units are complementary sensors for ego-motion estimation and environment mapping. Their combination makes visual-inertial odometry (VIO) systems more accurate and robust. For globally consistent mapping, however, combining visual and inertial information is not straightforward. To estimate the motion and geometry with a set of images large baselines are required. Because of that, most systems operate on keyframes that have large time intervals between each other. Inertial data on the other hand quickly degrades with the duration of the intervals and after several seconds of integration, it typically contains only little useful information. In this paper, we propose to extract relevant information for visual-inertial mapping from visual-inertial odometry using non-linear factor recovery. We reconstruct a set of non-linear factors that make an optimal…

Tables2

Table 1. TABLE I : RMS ATE of the estimated trajectory in meters on the EuRoC dataset for several different methods. In the upper part we summarize the results for the VIO methods that run optimization in a local window and estimate the pose of every camera frame. In the lower part we evaluate mapping methods that operate on all keyframes and perform global map optimization. In both evaluations the proposed system shows the lowest error on the majority of the sequences and outperforms the competitors. Note: The V2_03 sequence is excluded from the comparison because it has more than 400 missing frames for one of the cameras.

Sequence	MH_01	MH_02	MH_03	MH_04	MH_05	V1_01	V1_02	V1_03	V2_01	V2_02
VI DSO [26], mono	0.06	0.04	0.12	0.13	0.12	0.06	0.07	0.10	0.04	0.06
OKVIS [13] mono	0.34	0.36	0.30	0.48	0.47	0.12	0.16	0.24	0.12	0.22
OKVIS [13] stereo	0.23	0.15	0.23	0.32	0.36	0.04	0.08	0.13	0.10	0.17
VINS FUSION [20] mono	0.18	0.09	0.17	0.21	0.25	0.06	0.09	0.18	0.06	0.11
VINS FUSION [20] stereo	0.24	0.18	0.23	0.39	0.19	0.10	0.10	0.11	0.12	0.10
IS VIO [9] stereo	0.06	0.06	0.10	0.24	0.19	0.06	0.10	0.26	0.08	0.21
Proposed VIO, stereo	0.07	0.06	0.07	0.13	0.11	0.04	0.05	0.10	0.04	0.05
VI SLAM [12] mono, KF	0.25	0.18	0.21	0.30	0.35	0.11	0.13	0.20	0.12	0.20
VI SLAM [12] stereo, KF	0.11	0.09	0.19	0.27	0.23	0.04	0.05	0.11	0.10	0.18
VI ORB-SLAM [19], mono, KF	0.07	0.08	0.09	0.22	0.08	0.03	0.03	X	0.03	0.04
Pure BA, stereo, KF	0.09	0.08	0.05	0.27	0.16	0.04	0.03	X	0.04	0.04
BA + Identity Factors, stereo, KF	0.08	0.07	X	0.34	0.15	0.04	0.03	0.56	0.05	0.04
Proposed VI Mapping, stereo, KF	0.08	0.06	0.05	0.10	0.08	0.04	0.02	0.03	0.03	0.02

Table 2. TABLE II : Mean processing time in milliseconds of the mapping subsystem on EuRoC sequences normalized (divided) by the number of keyframes in the map.

Total	Factor Extraction	Keypoint detection	Matching and Triangulation	Optimization (10 iterations)
52.8	3.6	6.4	23.1	19.7

Equations60

J_{r (s)} = ξ \to 0 lim \frac{r ( s \oplus ξ ) ⊖ r ( s )}{ξ} .

J_{r (s)} = ξ \to 0 lim \frac{r ( s \oplus ξ ) ⊖ r ( s )}{ξ} .

E (s) = \frac{1}{2} r (s)^{⊤} Wr (s),

E (s) = \frac{1}{2} r (s)^{⊤} Wr (s),

E (s \oplus ξ) = E (s) + ξ^{⊤} J_{r (s)}^{⊤} Wr (s) + \frac{1}{2} ξ^{⊤} J_{r (s)}^{⊤} W J_{r (s)} ξ .

E (s \oplus ξ) = E (s) + ξ^{⊤} J_{r (s)}^{⊤} Wr (s) + \frac{1}{2} ξ^{⊤} J_{r (s)}^{⊤} W J_{r (s)} ξ .

ξ^{*} = - (J_{r (s)}^{⊤} W J_{r (s)})^{- 1} J_{r (s)}^{⊤} Wr (s) .

ξ^{*} = - (J_{r (s)}^{⊤} W J_{r (s)})^{- 1} J_{r (s)}^{⊤} Wr (s) .

r_{i} (ξ)

r_{i} (ξ)

s = {s_{k}, s_{f}, s_{l}},

s = {s_{k}, s_{f}, s_{l}},

x y z

x y z

r_{i t}

r_{i t}

q_{i} (u, v, d)

Δ R_{t + 1}

Δ R_{t + 1}

Δ v_{t + 1}

Δ p_{t + 1}

Δ s_{t + 1} = f (Δ s_{t}, a_{t + 1}, ω_{t + 1}),

Δ s_{t + 1} = f (Δ s_{t}, a_{t + 1}, ω_{t + 1}),

Δ s_{t + 1} = g_{t + 1} (b_{i}^{a}, b_{i}^{g}) .

Δ s_{t + 1} = g_{t + 1} (b_{i}^{a}, b_{i}^{g}) .

J_{g_{t + 1}}^{a}

J_{g_{t + 1}}^{a}

J_{g_{t + 1}}^{g}

Δ \tilde{s} (b_{i}^{a}, b_{i}^{g}) = Δ s (\overset{ˉ}{b}_{i}^{a}, \overset{ˉ}{b}_{i}^{g}) \oplus (J^{a} ϵ^{a} + J^{g} ϵ^{g}),

Δ \tilde{s} (b_{i}^{a}, b_{i}^{g}) = Δ s (\overset{ˉ}{b}_{i}^{a}, \overset{ˉ}{b}_{i}^{g}) \oplus (J^{a} ϵ^{a} + J^{g} ϵ^{g}),

r_{Δ R}

r_{Δ R}

r_{Δ v}

r_{Δ p}

Σ_{t + 1} = J_{f}^{s} Σ_{t} J_{f}^{s}^{⊤} + J_{f}^{a} Σ^{a} J_{f}^{a}^{⊤} + J_{f}^{g} Σ^{g} J_{f}^{g}^{⊤},

Σ_{t + 1} = J_{f}^{s} Σ_{t} J_{f}^{s}^{⊤} + J_{f}^{a} Σ^{a} J_{f}^{a}^{⊤} + J_{f}^{g} Σ^{g} J_{f}^{g}^{⊤},

E

E

H_{α α}^{m}

H_{α α}^{m}

b_{α}^{m}

H = [H_{α α} H_{β α} H_{α β} H_{β β}], b = [b_{α} b_{β}] .

H = [H_{α α} H_{β α} H_{α β} H_{β β}], b = [b_{α} b_{β}] .

E^{G} (s)

E^{G} (s)

D_{KL} (p (s) ∣∣ p_{a} (s)) = \frac{1}{2} (⟨ H_{a}, Σ_{o} ⟩ - lo g det (H_{a} Σ_{o}) + ∣∣ H_{a}^{\frac{1}{2}} (μ_{a} - μ_{o}) ∣ ∣^{2} - d),

D_{KL} (p (s) ∣∣ p_{a} (s)) = \frac{1}{2} (⟨ H_{a}, Σ_{o} ⟩ - lo g det (H_{a} Σ_{o}) + ∣∣ H_{a}^{\frac{1}{2}} (μ_{a} - μ_{o}) ∣ ∣^{2} - d),

J_{r} = ⋮ J_{i} ⋮ H_{r} = ⋱ 0 H_{i} 0 ⋱,

J_{r} = ⋮ J_{i} ⋮ H_{r} = ⋱ 0 H_{i} 0 ⋱,

D_{KL} (H_{r}) = ⟨ J_{r}^{⊤} H_{r} J_{r}, Σ_{o} ⟩ - lo g det (J_{r}^{⊤} H_{r} J_{r}) .

D_{KL} (H_{r}) = ⟨ J_{r}^{⊤} H_{r} J_{r}, Σ_{o} ⟩ - lo g det (J_{r}^{⊤} H_{r} J_{r}) .

H_{i} = ({J_{r} Σ_{o} J_{r}^{⊤}}_{i})^{- 1},

H_{i} = ({J_{r} Σ_{o} J_{r}^{⊤}}_{i})^{- 1},

r_{rel} (s, z_{rel})

r_{rel} (s, z_{rel})

r_{rp} (s, z_{rp})

r_{pos} (s, z_{pos})

r_{yaw} (s, z_{yaw})

E_{nfr}^{G} (s)

E_{nfr}^{G} (s)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Visual-Inertial Mapping with Non-Linear Factor Recovery

Vladyslav Usenko1, Nikolaus Demmel1, David Schubert1, Jörg Stückler2 and Daniel Cremers1 1 Vladyslav Usenko, Nikolaus Demmel, David Schubert and Daniel Cremers are with the Technical University of Munich, Germany {usenko, demmeln, schubdav, cremers}@in.tum.de2 Jörg Stückler is with MPI for Intelligent Systems Tübingen, Germany [email protected]

Abstract

Cameras and inertial measurement units are complementary sensors for ego-motion estimation and environment mapping. Their combination makes visual-inertial odometry (VIO) systems more accurate and robust. For globally consistent mapping, however, combining visual and inertial information is not straightforward. To estimate the motion and geometry with a set of images large baselines are required. Because of that, most systems operate on keyframes that have large time intervals between each other. Inertial data on the other hand quickly degrades with the duration of the intervals and after several seconds of integration, it typically contains only little useful information.

In this paper, we propose to extract relevant information for visual-inertial mapping from visual-inertial odometry using non-linear factor recovery. We reconstruct a set of non-linear factors that make an optimal approximation of the information on the trajectory accumulated by VIO. To obtain a globally consistent map we combine these factors with loop-closing constraints using bundle adjustment. The VIO factors make the roll and pitch angles of the global map observable, and improve the robustness and the accuracy of the mapping. In experiments on a public benchmark, we demonstrate superior performance of our method over the state-of-the-art approaches.

I Introduction

Visual-inertial odometry (VIO) is a popular approach for tracking the motion of a camera in application domains such as robotics or augmented reality. By combining visual and IMU measurements, one can exploit the complementary strengths of both sensors and thereby increase accuracy and robustness. Commonly, the optimization of camera trajectory and map is performed locally on a small window of recent camera frames and IMU measurements. This approach, however, is inevitably prone to drift in the estimates.

Globally consistent optimization for visual-inertial mapping is less explored in the computer vision community. While in principle the optimization could be formulated as bundle adjustment with additional IMU measurements, this approach would quickly become computationally infeasible due to the high number of frames which would lead to a large number of optimization parameters in a naive formulation. To keep the computational burden in bounds, bundle adjustment subsamples the high-frame rate images of the camera to a smaller set of keyframes. The common choice in VIO is to preintegrate IMU measurements between consecutive frames. If we select keyframes temporally far apart to make the optimization efficient, the preintegrated IMU measurements provide only little information to constrain the trajectory due to the accumulated sensor noise. The small frame rate also affects the quality of the estimated velocities and biases from visual and inertial cues which are required for pose prediction using preintegrated IMU measurements.

We propose a novel approach that formulates visual-inertial mapping as bundle adjustment on a high-frame-rate set of visual and inertial measurements. Instead of directly optimizing the camera trajectory for all frames, we propose a hierarchical approach which first recovers a local VIO estimate at the frame rate of the camera. Once keyframes are removed and marginalized from the current local VIO optimization window, we extract non-linear factors [15] that approximate the accumulated visual-inertial information about the camera motion between keyframes. The keyframes and non-linear factors are subsequently used on the global bundle-adjustment layer.

For the VIO layer, our method uses image features designed for fast and accurate tracking, while for the mapping layer we employ distinctive but lighting and viewpoint invariant keypoints that are suitable for loop closing. With this, our approach can leverage information from the IMU and short-term visual tracking at high frame rates together with keypoint matching and loop-closing at low frame rates for globally consistent mapping (Fig. 1). The factors also help to keep the map gravity-aligned, bridge between frames that do not have enough visual information. Our approach also makes the optimization problem smaller, since we do not have to estimate velocities and biases.

In summary, our contributions are:

•

We propose a novel two-layered visual-inertial mapping approach that integrates keypoint-based bundle-adjustment with inertial and short-term visual tracking through non-linear factor recovery.

•

As the first layer of our mapping approach we propose a VIO system which outperforms the state-of-the-art methods in terms of trajectory accuracy on the majority of the evaluated sequences. This is achieved by carefully combining appropriate components (patch tracking, landmark representation, first-estimate Jacobians, marginalization scheme) as detailed in Sec. IV.

•

Unlike other state-of-the-art systems that use preintegrated IMU measurements also for mapping, we subsume high-frame rate visual-inertial information in non-linear factors extracted from the marginalization prior of the VIO layer. This results not only in a smaller optimization problem but also in better pose estimates in the resulting gravity aligned map.

We encourage the reader to watch the demonstration video and inspect the open-source implementation of the system, which is available at:

https://vision.in.tum.de/research/vslam/basalt

II Related Work

**Visual-inertial odometry: ** Early methods for visual-inertial odometry are primarily filter-based [11, 18]. In tightly integrated filters, the prediction step typically propagates the current camera state estimate using the IMU measurements. The state is recursively corrected based on the camera images. A significant drawback of filters is that the linearization point for the non-linear measurement and state transition models cannot be changed, once a measurement is integrated. Fixed-lag smoothers (a.k.a. optimization-based approaches) such as [13, 27] relinearize at the current states in a local optimization window of recent frames. The visual-inertial state estimation is formulated as a full bundle adjustment (BA) over keyframes and IMU measurements. The problem is reduced to a computationally manageable size by marginalization of old frames up to the recent set in the optimization window. The continuous relinearization, windowed optimization and maintenance of the marginalization prior increase the accuracy of the methods. The above methods need to discard keypoints and observations that are observed in marginalized keyframes in order to maintain the sparse structure of the marginalization prior. Hsiung et al. [9] apply non-linear factor recovery to achieve a sparse marginalization prior without discarding information about observed keypoints. This way, the approach can further refine the keypoints and achieve higher accuracy, but in contrast to our work it is limited to local BA.

**Visual-inertial mapping: ** Only few works have tackled globally consistent mapping from visual and inertial measurements. Kasyanov et al. [12] add a pose-graph optimization layer with loop-closing on top of a keyframe-based visual-inertial odometry method [13]. The pose graph is built from the keyframes of the VIO and their relative pose estimates. In [19], the authors add inertial measurements to a keyframe-based SLAM system through IMU preintegration. The IMU measurements are preintegrated into a set of pseudo-measurements between keyframes. They notice that the accuracy of preintegrated measurements degrades over time and restrict the time between keyframes to 0.5 seconds in local BA and 3 seconds in global BA. A further shortcoming of the method is its requirement of estimating the camera velocity and IMU biases at each keyframe which is less well constrained through visual measurements than in our approach due to the strong temporal subsampling into keyframes. Schneider et al. [24] follow a similar approach in which preintegrated IMU measurements are inserted into the optimization. The approach in [20] proposes a combination of VIO and 4 degree-of-freedom (DoF) pose optimization for visual-inertial mapping. They fix 2 DoF (roll and pitch) and optimize only for the others. We also constrain roll and pitch from visual-inertial measurements. However, we extract non-linear factors in a probabilistic formulation which account for uncertainties in those values and are traded off with other information in the global probabilistic optimization.

III Preliminaries

In this paper, we write matrices as bold capital letters (e.g. $\mathbf{R}$ ) and vectors as bold lowercase letters (e.g. ${\bm{\xi}}$ ). Rigid-body poses are represented as $(\mathbf{R},\mathbf{p})\in\mathrm{SO}(3)\times\mathbb{R}^{3}$ or as transformation matrices $\mathbf{T}\in\mathrm{SE}(3)$ when needed. Incrementing a rotation $\mathbf{R}$ by an increment $\bm{\xi}\in\mathbb{R}^{3}$ is defined as $\mathbf{R}\oplus\bm{\xi}=\mathrm{Exp}(\bm{\xi})\mathbf{R}$ . The difference between two rotations $\mathbf{R}_{1}$ and $\mathbf{R}_{2}$ is calculated as $\mathbf{R}_{1}\ominus\mathbf{R}_{2}=\mathrm{Log}(\mathbf{R}_{1}\mathbf{R}_{2}^{-1})$ such that $(\mathbf{R}\oplus\bm{\xi})\ominus\mathbf{R}=\bm{\xi}$ . Here we use $\mathrm{Exp}\colon\mathbb{R}^{3}\rightarrow\mathrm{SO}(3)$ , which is a composition of the hat operator ( $\mathbb{R}^{3}\rightarrow\mathfrak{so}(3)$ ) and the matrix exponential ( $\mathfrak{so}(3)\rightarrow\mathrm{SO}(3)$ ) and maps rotation vectors to their corresponding rotation matrices, and its inverse $\mathrm{Log}\colon\mathrm{SO}(3)\rightarrow\mathbb{R}^{3}$ . For all other variables, such as translation, velocity and biases, we define $\oplus$ and $\ominus$ as regular addition and subtraction.

In the following we will use a state $\mathbf{s}$ that is defined as a tuple of several rotation and vector variables, and a function $\mathbf{r}(\mathbf{s})$ that depends on it and can also produce rotations and vectors as the result. An increment $\bm{\xi}\in\mathbb{R}^{n}$ is a stacked vector with all the increments of the variables in $\mathbf{s}$ . Then, the Jacobian of the function with respect to the increment is defined as

[TABLE]

Here, $\mathbf{s}\oplus\bm{\xi}$ denotes that each component in $\mathbf{s}$ is incremented with the corresponding segment in $\bm{\xi}$ using the appropriate definition of the $\oplus$ operator, and similarly for $\ominus$ . The limit is done component-wise, such that the Jacobian is a matrix. For Euclidean quantities, this definition is just a normal derivative, with an extension for rotations, both as function value and as function argument. For more details and possible alternative formulations we refer the reader to [2, 4, 7].

In non-linear least squares problems, we minimize functions of the form

[TABLE]

which is a squared norm of the sum of residuals with block-diagonal weight matrix $\mathbf{W}$ . In this case, $\mathbf{r}(\mathbf{s})$ is purely vector-valued. Near the current state $\mathbf{s}$ we can use a linear approximation of the residual, which leads to

[TABLE]

The optimum of this approximated energy can be attained using the Gauss-Newton increment

[TABLE]

With this, we can iteratively update the state $\mathbf{s}_{i+1}=\mathbf{s}_{i}\oplus\bm{\xi}^{*}$ until convergence.

IV Visual-Inertial Odometry

We formulate the incremental motion tracking of the camera-IMU setup over time as fixed-lag smoothing. First, we use patch-based optical flow to track a sparse set of points in the 2D image plane between consecutive frames. This information is then used in a bundle-adjustment framework which for every frame minimizes an error that consists of point reprojection and IMU propagation terms. To maintain a fixed parameter size of the optimization problem we marginalize out old states. In the remainder of this section we will discuss these stages in more detail.

IV-A KLT Tracking

As a first step of our algorithm we detect a sparse set of keypoints in the frame using the FAST [22] corner detector. To track the motion of these points over a series of consecutive frames we use sparse optical flow based on KLT [14]. To achieve fast, accurate and robust tracking we combine the inverse-compositional approach as described in [1] with a patch dissimilarity norm that is invariant to intensity scaling. Several authors suggested zero-normalized cross-correlation (ZNCC) for illumination-invariant optical flow [17, 25], but we use locally-scaled sum of squared differences (LSSD) defined in [21] which is computationally less expensive than alternatives.

We formulate the patch tracking problem as estimating the transform $\mathbf{T}\in\mathrm{SE}(2)$ between two corresponding patches in two consecutive frames that minimizes the differences between the patches according to the selected norm. Essentially, we minimize a sum of squared residuals, where every residual is defined as

[TABLE]

Here, $I_{t}(\mathbf{x})$ is the intensity of image $t$ at pixel location $\mathbf{x}$ . The set of image coordinates that defines the patch is denoted $\Omega$ and the mean intensity of the patch in image $t$ is $\overline{I_{t}}$ . A visualization of the patch and tracking results is shown in Fig. 2.

To achieve robustness to large displacements in the image we use a pyramidal approach, where the patch is first tracked on the coarsest level and then on increasingly finer levels. For outlier filtering, instead of an absolute threshold on the error, we track the patches from the current frame to the target frame and back to check consistency. Points that do not return to the initial location with the second tracking are considered as outliers and discarded.

IV-B Visual-Inertial Bundle Adjustment

To estimate the motion of the camera we combine error terms based on tracked feature locations from KLT tracking with IMU error terms based on preintegrated IMU measurements [8].

We use the following coordinate frames throughout the paper: W is the world frame, I is the IMU frame and $\text{C}_{i}$ is the frame of camera $i$ , where $i$ is the index of the camera in a stereo setup. We estimate transformations $\mathbf{T}_{\text{WI}}\in\mathrm{SE}(3)$ from IMU to world coordinate frame. The transformations $\mathbf{T}_{\text{IC}_{i}}$ from camera frame $i$ to IMU frame and the projection functions $\pi_{i}$ are assumed to be static and known from calibration. For the formulation of reprojection errors we denote the transformations from camera $i$ to world by $\mathbf{T}_{\text{WC}_{i}}$ . Those do not constitute additional optimization variables and are calculated using $\mathbf{T}_{\text{WI}}$ and $\mathbf{T}_{\text{IC}_{i}}$ in practice.

At different points in time, we optimize a state

[TABLE]

where $\mathbf{s}_{\text{k}}$ contains IMU poses for $n$ older keyframes, $\mathbf{s}_{\text{f}}$ contains IMU poses, velocities and biases of the $m$ most recent frames, which possibly are also keyframes if they host landmarks, and $\mathbf{s}_{\text{l}}$ contains landmarks. A graphical representation of the problem is shown in Fig. 5 (a). Landmarks are stored relative to the keyframe where they were observed for the first time [16] and defined by a unit-length direction vector in the coordinate frame of the camera and an inverse distance to the landmark [6]. In the proposed system only keyframes host landmarks, which distinguishes them from regular frames.

IV-B1 Representation of Unit Vectors in 3D

In order to avoid the necessity of additional constraints for the optimization and to keep the number of optimiziation variables small, we parametrize the bearing vector in 3D space using a minimal representation, which is two-dimensional. In [3] the authors provide an extensive review of possible parametrizations and suggest a new parametrization based on $\mathrm{SO}(3)$ rotations that yields simple derivatives with respect to 2D increments.

In this work we use a parametrization based on stereographic projection that given 2D coordinates $(u,v)^{\top}$ generates a unit-length bearing vector

[TABLE]

This parametrization is efficient as it only uses simple operations such as multiplication and division (compared to trigonometric operations needed in [6]) and is defined for all $u$ and $v$ . A geometric interpretation is shown in Fig. 3. The only direction vector that cannot be represented with finite $u,v$ is the negative $Z$ -direction $\begin{pmatrix}0&0&-1\end{pmatrix}^{\top}$ . However, this is not a drawback in practice, as cameras usually have a limited field of view and cannot see points behind them.

IV-B2 Reprojection Error

The first cue we can use for motion estimation is the reprojection error. When point $i$ that is hosted in frame $h(i)$ is detected in target frame $t$ at image coordinates $\mathbf{z}_{it}$ , the residual is defined as

[TABLE]

where $c(t)$ is the index of the camera used to take frame $t$ . The pose $\mathbf{T}_{t}$ denotes $\mathbf{T}_{\text{WC}_{c(t)}}$ at the time when frame $t$ has been taken, and similarly for $\mathbf{T}_{h(i)}$ . The first three entries of the homogeneous point coordinates $\mathbf{q}_{i}(u,v,d)$ are computed from the minimal representation $(u,v)$ as described in Sec. IV-B1, with an additional fourth entry $d$ , the inverse distance. Since the projection function is independent of scale we do not have to normalize $\mathbf{q}_{i}$ , which makes this formulation numerically stable even when $d$ is close or equal to zero.

IV-B3 IMU Error

The second cue for motion estimation is the IMU data. To deal with the high frequency of IMU measurements we preintegrate several consecutive IMU measurements into a pseudo-measurement. When adding an IMU factor between frame $i$ and frame $j$ , we compute pseudo-measurement $\Delta\mathbf{s}=(\Delta\mathbf{R},\Delta\mathbf{v},\Delta\mathbf{p})$ similar to [8]. For this, we compute bias-corrected accelerations $\mathbf{a}_{t}=\mathbf{a}_{t}^{\text{raw}}-\bar{\mathbf{b}}_{i}^{\text{a}}$ and rotational velocities $\bm{\omega}_{t}=\bm{\omega}_{t}^{\text{raw}}-\bar{\mathbf{b}}_{i}^{\text{g}}$ using the raw accelerometer $\mathbf{a}_{t}^{\text{raw}}$ and gyroscope $\bm{\omega}_{t}^{\text{raw}}$ measurements. We fix the corresponding biases $\bar{\mathbf{b}}_{i}^{\text{a}}$ and $\bar{\mathbf{b}}_{i}^{\text{g}}$ for the entire preintegration time and use linear approximation to account for changes in these variables.

For the timestamp $t_{i}$ of frame $i$ , we assign the initial state delta $\Delta\mathbf{s}_{t_{i}}=(\mathbf{I},\mathbf{0},\mathbf{0})$ . Then, for each IMU timestamp $t$ satisfying $t_{i}<t\leq t_{j}$ the following updates are calculated.

[TABLE]

This defines $\Delta\mathbf{s}_{t+1}$ as a function of $\Delta\mathbf{s}_{t}$ , $\mathbf{a}_{t+1}$ , and $\bm{\omega}_{t+1}$ ,

[TABLE]

with corresponding Jacobian $\mathbf{J}_{f}=[\mathbf{J}_{f}^{\text{s}},\mathbf{J}_{f}^{\text{a}},\mathbf{J}_{f}^{\text{g}}]$ . Furthermore, all previous iterations of $f$ up to $t+1$ define $\Delta\mathbf{s}_{t+1}$ as a function of the biases,

[TABLE]

Starting with zero-initialization, the corresponding Jacobian $\mathbf{J}_{g_{t+1}}=[\mathbf{J}_{g_{t+1}}^{\text{a}},\mathbf{J}_{g_{t+1}}^{\text{g}}]$ can be computed recursively using $\mathbf{J}_{f}$ ,

[TABLE]

which results from the chain rule. Eventually, the Jacobians of $g_{t_{j}}$ are denoted $\mathbf{J}^{\text{g}}$ and $\mathbf{J}^{\text{a}}$ . Small changes in biases can be represented as increments to the linearization point $\mathbf{b}^{\text{a}}_{i}=\bar{\mathbf{b}}^{\text{a}}_{i}+\bm{\epsilon}^{\text{a}}$ and $\mathbf{b}^{\text{g}}_{i}=\bar{\mathbf{b}}^{\text{g}}_{i}+\bm{\epsilon}^{\text{g}}$ . Then, $\Delta\mathbf{s}$ is approximated as

[TABLE]

with components $\Delta\tilde{\mathbf{s}}=(\Delta\tilde{\mathbf{R}},\Delta\tilde{\mathbf{v}},\Delta\tilde{\mathbf{p}})$ . The residuals are then calculated as

[TABLE]

where $\mathbf{g}$ is the gravity vector and $\mathbf{R}$ and $\mathbf{p}$ denote the rotation and translation components of $\mathbf{T}_{\text{WI}}$ , respectively. These residuals have to be weighted with an appropriate covariance matrix, which can be also calculated recursively. Starting from $\bm{\Sigma}_{t_{i}}=\mathbf{0}$ , updates are calculated as

[TABLE]

with diagonal matrices $\bm{\Sigma}^{\text{a}}$ and $\bm{\Sigma}^{\text{g}}$ that contain the hardware-specific IMU noise parameters for accelerometer and gyroscope. For more detailed information about the underlying physical model of the IMU and preintegration theory we refer the reader to [8].

IV-B4 Optimization and Partial Marginalization

For each new frame we minimize a non-linear energy that consists of reprojection terms, IMU terms and a marginalization prior $E_{\text{m}}$

[TABLE]

The reprojection errors are summed over the set of points $\mathcal{P}$ and for each point $i$ over the set $\mathrm{obs}(i)$ of frames where the point is observed, including its host frame. The set $\mathcal{C}$ contains pairs of frames which are connected by IMU factors.

The energy $E$ is optimized using the Gauss-Newton algorithm. To constrain the problem size we fix the number of keyframe poses and consecutive states that we optimize at every iteration. When a new frame is added, there are $n$ pose-only keyframes in $\mathbf{s}_{\text{k}}$ and the $m$ newest frames including the newly added one in $\mathbf{s}_{\text{f}}$ . After optimizing, we perform a partial marginalization of the state to prevent the problem size from growing.

Two possible scenarios for marginalization are shown in Fig. 5. In the first one we marginalize out the oldest non-keyframe. In this case we drop the landmark factors that have this frame as a target to maintain the sparsity of the problem. In the second case we have a new keyframe, so we marginalize out velocity and biases for this frame and one old keyframe with corresponding landmarks.

In both cases the marginalization is done on the linearized Markov blanket of the variables we want to remove, where the Markov blanket is a collection of incident states to those variables. The linearization $\mathbf{H}$ and $\mathbf{b}$ represent a distribution of the estimated state in the vector space of the increment $\bm{\xi}$ . If we split the increment $\bm{\xi}=[\bm{\xi}_{\alpha}^{\top},\bm{\xi}_{\beta}^{\top}]^{\top}$ into variables $\bm{\xi}_{\alpha}$ to stay in the system and variables $\bm{\xi}_{\beta}$ to be marginalized, we can compute the parameters of the new distribution using the Schur complement,

[TABLE]

where we have split the original $\mathbf{H}$ and $\mathbf{b}$ into

[TABLE]

$\mathbf{H}^{\text{m}}_{\alpha\alpha}$ and $\mathbf{b}^{\text{m}}_{\alpha}$ now define an energy term that only depends on $\bm{\xi}_{\alpha}$ and can be added to the total energy at the next iteration.

We use first-estimate Jacobians [10] to maintain the nullspace properties of the linearized marginalization prior. As soon as a variable becomes a part of the marginalization prior, its linearization point is fixed, and the Jacobian used to calculate $\mathbf{H}$ and $\mathbf{b}$ is evaluated at this linearization point, while the residuals are calculated at the current state estimate. Residuals already in the marginalization term have to be linearly approximated, thus not $\mathbf{b}_{\alpha}^{\text{m}}$ , but $\mathbf{b}_{\alpha}^{\text{m}}+\mathbf{H}^{\text{m}}_{\alpha\alpha}\bm{\delta}_{\alpha}$ is added to the Gauss-Newton optimization once $\bm{\xi}_{\alpha}$ deviates by $\bm{\delta}_{\alpha}$ from the state used to calculate the residuals in $\mathbf{b}_{\alpha}^{\text{m}}$ .

V Visual-Inertial Mapping

The fixed-lag smoothing method for visual-inertial odometry (Fig. 4) presented in the previous section accumulates drift in the estimate due to the fixed linearization points outside the optimization window. A typical approach to eliminate such drift is to detect loop closures and incorporate loop-closing constraints into the optimization. We propose a two-layered approach which runs our visual-inertial odometry on the lower layer and bundle-adjustment on the visual-inertial mapping layer, where we additionally use non-linear factors that summarize the keyframe pose information from the odometry layer. BA optimizes the camera poses of keyframes and positions of keypoints. We implicitly detect loop closures using keypoint matching and achieve globally consistent mapping.

V-A Global Map Optimization

To get statistically independent observations we detect and match ORB [23] features (distinct from VIO points) between the keyframes in the global map optimization. This allows us to use the reprojection error function as defined in Eq. (8). Combining this reprojection error with the error terms from the recovered non-linear factors yields the objective function:

[TABLE]

where $E_{\text{nfr}}(\mathbf{s})$ collects the error terms by the recovered non-linear factors. These factors and their recovery are detailed in the following. The state $\mathbf{s}$ that we optimize on this global optimization layer includes the keyframe poses and the positions of the new landmarks (parametrized as in Sec. IV-B1).

We interface the global map optimization with the VIO layer at the keyframe poses. When a keyframe is marginalized out from the VIO we save the linearization of the Markov blanket (Fig. 5 (c)) and marginalize all other variables except of keyframe poses. From this marginalization prior, we recover a set of non-linear factors on the keyframe poses that approximate the distribution stored in it.

V-B Non-Linear Factor Recovery

Non-linear factor recovery (NFR [15]) approximates a dense distribution stored in the linearized Markov blanket of the original factor graph with a different set of non-linear factors that yield a sparse factor graph topology. While the initial aim of NFR is to keep the computational complexity of SLAM optimization bounded, we use it to transfer information accumulated during VIO to our globally consistent visual-inertial map optimization.

By linearization of the residual function of a non-linear least squares problem Eq. (2), we obtain a multivariate Gaussian distribution $p(\mathbf{s})\sim N(\bm{\mu}_{\text{o}},\mathbf{H}_{\text{o}}^{-1})$ in which the mean $\bm{\mu}_{\text{o}}$ equals the state estimate. We want to construct another distribution $p_{\text{a}}(\mathbf{s})\sim N(\bm{\mu}_{\text{a}},\mathbf{H}_{\text{a}}^{-1})$ that well approximates the original distribution with a sparser factor graph topology.

We follow NFR [15] and minimize the Kullback-Leibler divergence (KLD) between the recovered distribution and the original distribution. More formally, we minimize

[TABLE]

where $\mathbf{\Sigma}_{\text{o}}=\mathbf{H}_{\text{o}}^{-1}$ and $d$ is constant.

For the $i$ th non-linear factor that we want to recover, we need to define a residual function such that $\mathbf{r}_{i}(\mathbf{s},\mathbf{z}_{i})=\bm{\epsilon}$ with $\bm{\epsilon}\sim N(\mathbf{0},\mathbf{H}_{i}^{-1})$ . NFR estimates the pseudo measurements $\mathbf{z}_{i}$ and information matrices $\mathbf{H}_{i}$ for the factors. Choosing $\mathbf{z}_{i}$ such that $\mathbf{r}_{i}(\bm{\mu}_{\text{o}},\mathbf{z}_{i})=\mathbf{0}$ induces $\bm{\mu}_{\text{a}}=\bm{\mu}_{\text{o}}$ which makes the third term of (27) vanish. To estimate $\mathbf{H}_{i}$ we define

[TABLE]

where $\mathbf{J}_{\text{r}}$ stacks the Jacobians of the defined residual functions with respect to the state, and $\mathbf{H}_{\text{r}}$ is a block diagonal matrix that consists of the $\mathbf{H}_{i}$ for the corresponding residual functions. This allows us to write $\mathbf{H}_{\text{a}}=\mathbf{J}^{\top}_{\text{r}}\mathbf{H}_{\text{r}}\mathbf{J}_{\text{r}}$ , and consequently, we can recover the information matrices $\mathbf{H}_{i}$ by minimizing

[TABLE]

For full-rank and invertible $\mathbf{J}_{\text{r}}$ , [15, 9] showed that the following closed-form solution exists,

[TABLE]

where $\{\}_{i}$ denotes the corresponding diagonal block.

V-C Non-Linear Factors for Distribution Approximation

When we need to marginalize out a keyframe as shown in Fig. 5 (c), we save the current linearization and marginalize out everything except the keyframe poses. This gives us a factor that densely connects all keyframe poses in the optimization window. We use it to recover non-linear factors between the marginalized keyframe and all other keyframes as shown in Fig. 6. We define the following residual functions:

[TABLE]

where with $\lfloor\rfloor_{xy}$ we denote $x$ and $y$ components of the vector and with $\mathbf{z}$ we denote the recovered measurements from the estimated state at the time of linearization. In our case $\mathbf{z}_{\text{rel}}=\mathbf{T}_{i}^{-1}\mathbf{T}_{j}\in\mathrm{SE}(3)$ , $\mathbf{z}_{\text{rp}}=\mathbf{R}_{i}\in\mathrm{SO}(3)$ , $\mathbf{z}_{\text{pos}}=\mathbf{p}_{i}\in\mathbb{R}^{3}$ and $\mathbf{z}_{\text{yaw}}=\mathbf{R}_{i}^{-1}\begin{pmatrix}1&0&0\end{pmatrix}^{\top}\in\mathbb{R}^{3}$ .

We recover pairwise relative-pose factors between the keyframe that we will remove and all other current VIO keyframes. For that keyframe we also recover roll-pitch, absolute position and yaw factors (Fig. 6). This gives us a full-rank invertible Jacobian $\mathbf{J}_{\text{r}}$ which means that we can use Eq. (30) for recovering information matrices for the factors.

Since yaw and absolute position are 4 unobservable states of the VIO, the only information we have there comes from the initial prior on the start pose. As we do not need this information for the global map we drop yaw and absolute position factors, and only take relative pose and roll-pitch factors for the map optimization. With these factors, the energy terms $E^{\text{G}}_{\text{nfr}}$ become

[TABLE]

where $\mathcal{R}$ is a set of all relative pose factors and $\mathcal{P}$ is the set of all roll-pitch factors.

VI Evaluation

To evaluate the presented approach we conduct evaluation on the EuRoC dataset [5] and compare it to other state-of-the-art systems. We present the evaluation for both our VIO subsystem and our full visual-inertial mapping approach. Our VIO runs the optimization in a local window of frames and provides a pose for every tracked frame, while the mapping system performs global map optimization for keyframes that were selected by the VIO. To measure the accuracy of the evaluated systems, we use the root mean square (RMS) of the absolute trajectory error (ATE) after aligning the estimates with ground truth.

System parameters

At the KLT tracking stage the image is divided into a regular grid with the cell size of 50 pixels. For each cell that has no point tracked from the previous frame, one feature point with the best FAST response is extracted (if it exceeds the threshold). With the resolution of the EuRoC dataset it results in 80-120 features tracked by the system at every point in time. At the VIO level we use a window of 7 old keyframes (poses) and 3 latest temporal states (poses, velocities and biases). The newest temporal state is selected as a keyframe if less than 70% of the KLT features are connected to the currently tracked points in the local map.

Accuracy

The results of the evaluation are summarized in Table I. When considering visual-inertial odometry methods our system shows the best performance on eight out of ten sequences while the closest competitor (VI DSO [26]) shows the best results on five.

To evaluate the mapping part we compare it to the visual-inertial version of ORB-SLAM [19], where the vision subsystem is very similar to the one proposed in our mapping layer (ORB keypoints). The main difference lies in the inertial part where ORB-SLAM uses preintegrated measurements between keyframes, while we use recovered non-linear factors that summarize IMU and visual tracking on the VIO layer.

The proposed system clearly outperform ORB-SLAM on the “machine hall” sequences where the large scale of the environment results in large time intervals between keyframes. On the “Vicon room” sequences the difference is smaller, since the rapid motion of the MAV that carries the camera in a small room results in many keyframes with small time intervals between them.

Qualitative results of reconstructed maps are shown in Fig. 1. With the proposed system we are able to reconstruct globally consistent gravity-aligned maps and recover keyframe poses even for segments where no matches between detected ORB features can be estimated.

Factor Weighting

To evaluate the importance of the extracted factors and their proper weighting in the final mapping results we consider two alternative implementations. In the first one we do not use any factors and rely purely on the BA with ORB features. In the second one we extract the factors, but use identity weights (i.e. $\mathbf{H}_{ij}=\mathbf{H}_{i}=\mathbf{I}$ in Eq. (35)) for all of them, which is a typical approach for pose graph optimization [19, 20]. The evaluation results presented in Table I show that the system with the factor weights recovered according to Sec. V results in better accuracy and robustness when compared to those alternatives.

Timing

The main source of timing improvement for the mapping stage is the fact that for a global optimization requires a 2.5 smaller state (no velocity or biases) compared to the naive IMU integration. In absolute numbers we test our system on an Intel E5-1620 CPU (4 cores, 8 virtual cores). Our implementation is highly parallel and utilizes all available CPU resources. For the VIO the average time per frame on the EuRoC sequences is 7.83 ms (largest: 9.4 ms on MH_02; smallest: 5.5 ms in V1_03). On average 11.5% of the frames are selected as keyframes and proceed to the mapping stage.

The timing of the mapping stage is provided in Table II. In particular, for the MH_05 sequence (see Fig. 1, 2273 stereo frames, 114 seconds) the processing takes 19.2 seconds for VIO and 9.7 seconds for mapping for the entire sequence (around 4x faster than real-time playback).

VII Conclusions

In this paper we present a novel approach for visual-inertial mapping that combines the strengths of highly accurate visual-inertial odometry with globally consistent keyframe-based bundle adjustment. We achieve this in a hierarchical framework that successively recovers non-linear factors from the VIO estimate that summarize the accumulated inertial and visual information between keyframes. VIO is formulated as fixed-lag smoothing which optimizes a set of active recent frames in a sliding window and keeps past information in marginalization priors. The accumulated VIO information between keyframes is extracted and retained for the visual-inertial mapping when a keyframe falls outside the window and is marginalized.

Compared to alternative approaches that use preintegrated IMU measurements between keyframes our system shows better trajectory estimates on a public benchmark. This formulation has the potential to reduce the computational cost of optimization by reducing the dimensionality of the state space and enable large-scale visual-inertial mapping. Integrating information from other sensor modalities or extending the system for multi-camera settings are interesting directions for future research.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Baker and I. Matthews, “Equivalence and efficiency of image alignment algorithms,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE Comput. Soc, 2001.
2[2] T. Barfoot, State Estimation for Robotics . Cambridge University Press, 2017.
3[3] M. Bloesch, M. Burri, S. Omari, M. Hutter, and R. Siegwart, “Iterated extended Kalman filter based visual-inertial odometry using direct photometric feedback,” The International Journal of Robotics Research (IJRR) , vol. 36, no. 10, pp. 1053–1072, sep 2017.
4[4] M. Bloesch, H. Sommer, T. Laidlow, M. Burri, G. Nützi, P. Fankhauser, D. Bellicoso, C. Gehring, S. Leutenegger, M. Hutter, and R. Siegwart, “A primer on the differential calculus of 3D orientations,” ar Xiv:1606.05285 [cs.RO] , jun 2016.
5[5] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The Eu Ro C micro aerial vehicle datasets,” The International Journal of Robotics Research (IJRR) , vol. 35, no. 10, pp. 1157–1163, jan 2016.
6[6] J. Civera, A. Davison, and J. Montiel, “Inverse depth parametrization for monocular SLAM,” IEEE Transactions on Robotics (TRO) , vol. 24, no. 5, pp. 932–945, oct 2008.
7[7] E. Eade, “Lie groups for computer vision,” Technical Report, Cambridge University , 2014.
8[8] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “IMU preintegration on manifold for efficient visual-inertial maximum-a-posteriori estimation,” in Proc. of Robotics: Science and Systems (RSS) . Robotics: Science and Systems Foundation, jul 2015.