Stereo relative pose from line and point feature triplets
Alexander Vakhitov, Victor Lempitsky, and Yinqiang Zheng

TL;DR
This paper introduces two new minimal solvers for stereo relative pose estimation using triplets of point or line features with known projections, improving motion estimation in stereo visual odometry.
Contribution
The paper provides a complete classification of minimal cases with three features and introduces two solvers capable of handling all such cases, enhancing stereo visual odometry.
Findings
New solvers improve motion estimation accuracy
Complete classification of minimal cases
Enhanced performance in visual SLAM systems
Abstract
Stereo relative pose problem lies at the core of stereo visual odometry systems that are used in many applications. In this work, we present two minimal solvers for the stereo relative pose. We specifically consider the case when a minimal set consists of three point or line features and each of them has three known projections on two stereo cameras. We validate the importance of this formulation for practical purposes in our experiments with motion estimation. We then present a complete classification of minimal cases with three point or line correspondences each having three projections, and present two new solvers that can handle all such cases. We demonstrate a considerable effect from the integration of the new solvers into a visual SLAM system.
| Sequence | ORB-SLAM2 | ORB-SLAM2+EpiSEgo | ||||
|---|---|---|---|---|---|---|
| F% | F% | |||||
| 00 | 100% | - | - | 0% | 62% | |
| 01 | 40% | 3% | 40% | 1.6% | ||
| 02 | 100% | - | - | 0% | 53% | |
| 03 | 60% | 0.8% | 0% | 3% | ||
| 04 | 40% | 0.8% | 0% | 0.8% | ||
| 05 | 100% | - | - | 0% | 55% | |
| 06 | 100% | - | - | 0% | 7% | |
| 07 | 80% | 1.3% | 0% | 7% | ||
| 08 | 100% | - | - | 0% | 63% | |
| 09 | 100% | - | - | 80% | 63% | |
| 10 | 100% | - | - | 0% | 36% | |
| Method | Time, ms |
|---|---|
| EpiSEgo | 2 |
| Approx | 8 |
| Pradeep | 0.1 |
| P3P | 0.05 |
| Method | ||
|---|---|---|
| 00 | 06 | |
| EpiSEgo | 57.54 | 10.53 |
| Approx | 564.80 | 39.18 |
| Pradeep | 76.30 | 891.81 |
| P3P | - | 51457 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Advanced Image and Video Retrieval Techniques
11institutetext: Skoltech, Moscow, Nobelya Ulitsa 3, 121207, Russia
11email: {a.vakhitov, lempitsky}@skoltech.ru
22institutetext: NII, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
22email: [email protected]
Stereo relative pose from line and point feature triplets
Alexander Vakhitov 11
Victor Lempitsky 11
Yinqiang Zheng 22
Abstract
Stereo relative pose problem lies at the core of stereo visual odometry systems that are used in many applications. In this work we present two minimal solvers for the stereo relative pose. We specifically consider the case when a minimal set consists of three point or line features and each of them has three known projections on two stereo cameras. We validate the importance of this formulation for practical purposes in our experiments with motion estimation. We then present a complete classification of minimal cases with three point or line correspondences each having three projections, and present two new solvers that can handle all such cases. We demonstrate a considerable effect from the integration of the new solvers into a visual SLAM system.
Keywords:
minimal solver, stereo visual odometry, generalized camera, relative pose, line features
††The work is funded by the Russian MES grant RFMEFI61516X0003; a part of this work was finished when Alexander Vakhitov was visiting the National Institute of Informatics (NII), Japan, funded by the NII MOU/Non-MOU International Exchange Program.
1 Introduction
Minimal solvers in computer vision are used to generate camera motion hypotheses from required minimal sets of feature correspondences, e.g. five feature point correspondences for single camera relative pose estimation [1]. Such solvers are mostly used as a source of motion hypotheses inside a RANSAC loop [2]. They are useful in providing initialization for the optimization procedures at the core of state-of-the-art SLAM systems [3]. For many pose estimation problems, such solvers have already been developed and are extensively used, e.g. to create large-scale structure from motion reconstructions involving thousands of images [4]. It is important to develop minimal solvers taking line segment correspondences as input in addition to points. As recent works demonstrated [5, 6], the use of line segment features can considerably improve accuracy and robustness of visual SLAM and structure from motion systems.
To the best of our knowledge, there is no minimal solver for stereo camera relative pose estimation which is efficient enough for real-time use and does not rely on simplifying assumptions limiting its applicability. Thus, [7] is computationally heavy for real-time use, [8] is non-minimal and [9] employs an approximate rotation model that is valid only for small rotations. In this work, we describe two solvers that aim to close this gap, giving an efficient minimal solution to the stereo camera egomotion from three feature triplets. We assume that there are two stereo cameras with projection matrices for the first camera, and for the second one, where the baseline is known. The goal of the solvers is to find and . In each case, we use three feature triplets, where each triplet is a set of three111In the presence of two-view correspondences only, the overlapped stereo can be regarded as a non-overlapped stereo, and some solutions have been proposed as in [10, 9, 7]. We exclude this case from consideration because the major focus of this work is on overlapped stereo systems. projections of a 3D line or a 3D point computed using ,
While the ability to use features with three rather than four known projections may seem unnecessary for a stereo system, we show that such ability actually provides considerable benefits. To illustrate this, we made a motivation experiment using the first sequence of the KITTI Odometry dataset [11]. We use ORB [12] and LBD [13] features and matched them between neighboring frames and across stereo-views. We then use the provided ground truth poses to estimate the ratio of inlier matches. We observe that for ORB matches the ratio of inlier matches across triplets of views is greater than those across quadruplets ( vs ). For LBD line matches, the advantage is even greater ( vs ). The advantage of relying on triplet matches is further corroborated in our experiments. We develop two solvers covering any combinations of the point/line correspondences among the two view pairs. The first solver delivers 16 solutions, which is equal to the degree of the corresponding algebraic variety. It is impossible to obtain a solver for the formulated equations with the smaller number of solutions. The second solver outputs 32 solutions but is computationally simpler. Both are novel: to the best of our knowledge, no prior work describes a solution to the stereo camera relative pose problem for any combination of line/point features with three projections or even only for the point features.
Experiments show that our solvers are numerically stable and computationally efficient. More interestingly, by using point and line features simultaneously, our solvers work reliably for real scenarios. The use of three-view correspondences allows increasing the inlier cardinality and ratio, which not only facilitates the RANSAC procedure, but also reduces the risk of drifting in the case of long trajectories.
To summarize, we make the following contributions. Firstly, we systematically explore the stereo ego-motion estimation problem in the case of a minimal set of three point and line features with three correspondences. Secondly, we develop new minimal solvers, which output a minimal number of solutions, and demonstrate the increase in accuracy and robustness of stereo egomotion estimation on simulated and real data.
In Section 2, we review the most closely related works on ego-motion estimation. In Section 3, we show the problem formulation and the complete categorization of the minimal point and line sets in any three views. We present the experiment results in Section 2.
2 Related work
Non-overlapping fields of view:
To increase the coverage of the field of view (FoV) and to decrease the costs as much as possible, it became popular in recent years to use multiple cameras without overlapped FoVs. The generalized relative pose method proposed in [7] can be applied to estimate the relative pose of such multicamera systems, however it returns up to 64 solutions and is too computationally expensive for real-time use. To solve the problem in real-time, authors introduce certain approximations, e.g. Kneip and Li [10] proposed to use non-minimal point sets and developed an approximated iterative optimization method, whose running speed is inappropriate for realtime applications. For acceleration, Ventura et al. [9] linearized the rotation between two consecutive time frames, so the solver does not apply in the general visual odometry setting.
Overlapping fields of view:
Binocular stereo systems with partially overlapping FoVs are preferable in terms of system calibration and metric reconstruction. To estimate the ego-motion of an overlapping stereo rig, Nister et al. [14] proposed to use three points or two lines matched across all four views via triangulation. Chandraker et al. [15] showed that the triangulation of four-view correspondences for ego-motion estimation is unstable, especially when the baseline is small. They proposed instead to use three four-view line correspondences. Pradeep and Lim [8] used assorted point and line features and developed several minimal solvers for any point and line combinations, as long as these features are simultaneously visible in all four views. Clipp et al. [16] used point features in a mixed number of views, and Dunn et al. [17] used similar input data and accelerated the solving speed by using the constraints in proper ways. Discarding the correspondences without projections onto both views of one stereo camera, one can use generalized absolute pose solvers [18, 19, 20, 21, 22, 23]. To summarize, no prior work addresses stereo relative pose problem for three features with three projections. Most of the studies consider the case of 4-view correspondences of only point features.
3 Stereo egomotion solvers
We address the problem of feature-based relative pose estimation for the binocular stereo camera, assuming that each line or point has exactly three projections. The minimal set in this case consists of three features. Trifocal tensors provide a way to formulate constraints for the line and point features arising from three perspective views. Using the translation and rotation parameterizations (1), (2) which are explained below, these trifocal constraints become third-order equations. While for each line feature there are two such equations, for every point feature nine equations are obtained [24], of which only two are linearly independent. This effect complicates the solver construction.
At the same time, in our problem formulation, for each feature there always exists a stereo camera such that the feature is projected onto both of its views (the main camera). This simplifies the problem and allows to use projection constraints or two-view epipolar constraints between each view of the main camera and the view of the other camera. We use these approaches below and show that we can obtain 16 or 8 solutions using the proposed solvers, compared to 64 solutions using the solver [7] for the same problem.
3.1 Problem
We assume that there are two binocular rectified and calibrated stereo cameras with the same known baseline. We are given a set of triplet feature correspondences. Each correspondence is a triplet. For point feature, a triplet is , where denotes a homogeneous vector of point projection’s coordinates onto a view of a camera . For a line feature, a triplet is where denotes a vector of 2D line’s coefficients of a 3D line’s projection onto a view of a camera .
W.l.o.g., we assume that the baseline has unit length () and the projection matrices for a view of a camera are , and . Our goal is to find .
3.2 Analysis of feature combinations
As long as there are exactly three projections for each feature, we use the following definition.
Definition: If a feature is projected onto both views of some stereo camera, this camera is called the main camera for this feature.
We use the following notation for feature/correspondence combinations. We refer to problem as when the first camera is the main for points and lines, while the second one is the main for points and lines. To simplify the analysis, for those combinations having points we assume that the first camera is the main for at least one point feature. Some combinations are reducible to other ones by swapping the first and second cameras.
The categorization of the possible feature combinations is summarized in Tab. 1. For a homogeneous minimal set, there are two possible feature divisions between the cameras: S2L-1L and S3L for lines, or S2P-1P and S3P for points. If we have two points and one line, we can get only S2P1L, S2P-1L, S1P1L-1P cases. For one point and two lines, there are S1P2L, S1P1L-1L and S2L-1P cases. No other feature/correspondence combinations are possible.
If all the features have the same main camera (i.e. S3L, S3P, S2P1L, S1L2P), they can be triangulated in the coordinate frame of this camera, and the problem reduces to generalized absolute pose [20] for lines and points known to have 8 possible solutions. If a minimal set consists only of lines (S3L and S2L-1L), it admits a particular straightforward scheme of solution (“easy” cases).
The other situations (S2P-1L, S1P1L-1P, S1P-2L, S1P1L-1L, S2P-1P) are the “hard” cases. They share two common properties: the features have different main cameras and there is at least one point in the feature set. Minimal solvers for them are the main contributions of the paper.
In the next section, we propose two polynomial solver-based approaches for the “hard” cases. After that, we show how the other cases can be reduced to finding the roots of a single eight-degree polynomial, and then a recently proposed method [23] can be used. For the degeneracy analysis, see Supp. Mat.
3.3 “Hard” cases
In this section, we consider the situation when the features have different main cameras and there is at least one point in the minimal set. Without loss of generality, a camera is the first one if the first (and maybe the only) point feature is projected onto both views of this camera. We also assume that it is projected onto the first view of the second camera. We use the first point to express the translation in terms of the point’s depth and rotation matrix elements, as in [7]. In particular, from an equation describing the point’s projection onto the first view of the second camera we get
[TABLE]
where is the point’s position triangulated in its main camera’s coordinates, is the homogeneous vector of the point’s projection, is the depth constant. We will denote as the translation of the view w.r.t. the stereo camera coordinate system, where iff , else . We use the unit quaternion-based rotation parameterization:
[TABLE]
[TABLE]
We have experimented with two ways of formulation of the polynomial equations for the stereo egomotion problem explained in the following paragraphs.
3.3.1 Solver based on Epipolar/Pluecker constraints.
We describe next a solver for the ’hard’ cases which uses generalized epipolar constraints as in [7]. If the feature is a point, we analyze the epipolar constraint arising from its projection onto the view of the first camera and onto the view of the second camera. The epipolar line has the equation in homogeneous coordinates using the essential matrix where is a matrix of a cross product with a vector . Then, the point’s projection lies on the epipolar line, which translates to the following constraint:
[TABLE]
For the point feature, we will get two constraints of the form (4) with the unknowns and . Using 3D line’s projections onto the views of its main camera we compute a pair of 3D points lying on the line . Assuming that , we get the following expression for the line through projections of the points and :
[TABLE]
where is a scaling parameter. It leads to the following constraint:
[TABLE]
Likewise, we obtain the following constraint for :
[TABLE]
A system of the constraints (4), (6) or (7) can be formulated as
[TABLE]
where is a vectorized matrix , and and are coefficient matrices.
Substituting the parameterization (2) into (8), we get four equations of degree three w.r.t. and add to them the constraint (3). After formulating these equations over , we find using Maple [25] that the dimension of the quotient ring for the polynomial ideal is 32, see [26] for details. Each term in the equations (8) after substitution of (2) is of degree 2 w.r.t. . We divide the equations by , and denote , . We choose as a divisor because it is close to one if the rotation is not big, which is the typical case for the SLAM systems. Finally, we get the constraints in the vector form:
[TABLE]
where is a matrix consisting of second-degree polynomials. All the sub-matrices of must have zero determinants. It gives six equations of degree four, which we multiply with all the monomials of of degree three and obtain 240 equations and then use them to construct an elimination template.
After the LU-decomposition of the template matrix, using the action monomial to construct an action matrix, we obtain the solutions by eigen-decomposition, find from the null-space of , find using the unit-norm constraint (3) and using (1).
3.3.2 Solver based on point projection constraints.
For the this solver, we apply the known preprocessing rotation to the projections of all the features to the views of the second camera. is chosen so that the first point’s projection is in the image center: , see (1) for the definition of . The baseline vectors of the cameras become different, we denote them as , where is the stereo camera index, and get and .
We define a function describing the point projection process, which takes a 3D point expressed in the first camera’s coordinate frame and outputs the homogeneous point projection coordinates to view of the second camera:
[TABLE]
which is a standard point projection equation after we substitute the translation according to (1). By noting that the rotation from the second to the first camera is and the translation is , using (1), we formulate a similar function returning a projection of a 3D point expressed in the second camera’s coordinate frame to a view of a first camera:
[TABLE]
We assume that the camera is the main one for the feature, and that the feature also has a projection onto a view of a camera . The constraint for the point feature is obtained from by expressing and substituting the depth parameter :
[TABLE]
where is the coordinate index of the feature projection, and is found by triangulation using the point’s projections onto the main camera views. The constraint for the line feature is:
[TABLE]
Using these constraints and substituting the parameterization (2), we get a system of four equations:
[TABLE]
where is a matrix of second-degree polynomials.
Generating in the systems for all the possible feature combinations together with a constraint (3) and using Maple [25] we find that the quotient ring dimension and the number of solutions is 16.
From the system (14) by subtracting equations we obtain one or two (S2L-1P) linearly independent second-degree equations free of . As before, by computing determinants we get fourth-degree equations. The final system consists of six fourth degree equations (or five for SP-2L, because one of the determinants is identically zero), one (or two, for SP-2L) -free second-degree equations, and a quadratic constraint (3). This system also leads to 16 solutions.
The basis of the remainder quotient ring as a vector space is not the same for different feature combinations. In particular, for the S1P1L-1P and S1P1L-1L cases there is one particular basis, and another one for the combinations S2P-1L, S2P-1P, S2L-1P (see Supp.Mat.).
We solve the obtained system by constructing an elimination template. Denote the second degree equation obtained after subtraction as , the unit norm constraint as , the other equations as , . We form an equation set from multiplied with , multiplied with , and for . We multiply every equation from by , then by , then by , then by , and add all the equations obtained after every multiplication operation to a set of cardinality 975. It allows us to express all the basis monomials times the action variable . We use LU decomposition and get the action matrix of size . It is four times smaller than in the case of the Epipolar/Pluecker constraints, so the eigendecomposition can be performed faster, but template construction and LU decomposition will be slower.
3.4 Only line features
The previous analysis is missing two situations: when all the features are lines or when they all have the same main camera. Next, we describe how both situations lead to a second-degree polynomial system in three unknowns, and therefore can be addressed by an already developed method for this type of geometric computer vision problems.
The coefficients of the 3D line’s projection coincide with the direction of the normal to the plane through the camera center and the 3D line. If we observe the line from three views, we know normals of three different planes containing the line: , and their triple product is equal to zero: . We are going to use this fact to formulate the constraints as follows:
[TABLE]
for or depending on whether the main camera for the feature is the first or the second one, is the view number. Three such constraints formulated using (2) result in a second-degree system with 4 unknowns, the fourth equation being the unit norm constraint.
3.5 Single main camera
If all the features have the same main camera, their 3D coordinates w.r.t. this camera can be computed, and then the problem becomes a particular case of the generalized absolute pose problem (gP3P). In the case of three point features several methods are available in this case, e.g. [23]. To the best of our knowledge, there are no papers analyzing the generalized absolute pose problem for the mixed point/line minimal sets.
We propose to use the earlier introduced constraints (12,13) here as well. Without loss of generality, we assume that the first camera is the main one for all the features, so we use the constraints with for . The depth enters the system linearly. It can be expressed as a linear combination of terms involving other unknowns. This way we obtain a system of three equations w.r.t the unknown rotation matrix parameters.
In both cases, we transform a system with four unknowns into a system of three quadrics in three unknowns by the use of the constraint (3) to remove the zero-order terms and division by .
The recent work [23] provides a framework to which our problem fits well, proposing a way to reduce a problem of three quadrics intersection to root-finding of a single eighth-degree polynomial by using the hidden variable method to construct a single eight-order polynomial w.r.t. , and we customize the method by adaptively choosing the variable to hide using the condition numbers (see Supp.Mat.).
To sum up, we have considered all possible feature and correspondence combinations up to symmetries. In the cases for three line features or when all features share the same main camera, the method [23] can be applied. The remaining cases are the most difficult and we have proposed two new polynomial solvers to cope with them. Next, we demonstrate the benefits of using the proposed solvers in synthetic experiments and on real data.
4 Experiments
4.1 Simulated data
4.1.1 Setup:
We perform a number of synthetic experiments to evaluate our method against the non-minimal assorted features solver [8] (Pradeep). For point-only configurations, we also compare to an approximate minimal solver for generalized relative pose from points for small rotation [9] (Approx). Finally, we evaluate bundle adjustment (BA) initialized with a true pose that uses the ”gold standard“ geometric feature reprojection error. We use BA with oracle initialization as a reference to demonstrate the best realistically achievable accuracy. While our methods and Approx use minimal feature sets, Pradeep needs four projections for every feature. We have re-implemented Pradeep and used the original code of Approx. We evaluate both of the proposed solvers, namely the epipolar constraint-based (EpiSEgo) solver and the point projection constraint-based solver (PPSEgo).
We assume that the stereo camera is rectified and the baseline is . We also consider square images with the side of 1000 pixels and the vertical and horizontal view angle of 90*∘*. We fix the first stereo camera at the origin and randomly place the second camera. The points, as well as the 3D endpoints of the lines, are uniformly sampled from the box . The distance between stereo cameras is sampled uniformly from the interval . The second camera is rotated with angle uniformly sampled from the interval and around a random axis direction. If less than seven vertices of are visible, the pose is resampled. We add Gaussian noise with pixels to the projections of the points and to line segments’ endpoints. The lines have the length sampled uniformly from , the line generating process follows [27]. Each experiment consists of 1000 random simulations for each of the possible feature/correspondence combinations.
4.1.2 Results.
We compute the median absolute rotation error (in degrees) and relative translation error (in %) for three overlapping sets of feature/correspondence combinations: ’hard’ cases, ’easy’ cases (i.e. three line features or features sharing the same main camera), and the point-only cases. To check numerical stability, we use zero additive noise and get the median (mean) rotation errors of () degrees for PPSEgo and () degrees for EpiSEgo, which is comparable to errors reported for similar solvers [7]. PPSEgo is thus more numerically stable.
The next experiment (Fig. 2) shows that the difference diminishes when the noise is present. Here we vary from to . The errors for all methods increase with the noise level, and the accuracy of the proposed solvers is close to the reference one of the BA and better than the one of Pradeep and Approx. The translation errors tend to be higher if more line features are in the minimal set.
Next, we vary the rotation magnitude from 0 to 45 degrees ( Fig.3). The rotation accuracy for the SEgo methods approaches BA accuracy, while translation errors are bigger. The accuracy of both rotation and translation of Approx drops rapidly due to the use of small angle rotation approximation.
We then vary the translation magnitude from 1 to 33 ( Fig. 3-right). Due to the choice of relative error to measure translation, we observe that translation accuracy increases, while the rotation errors grow and then stabilize. PPSEgo has slightly better translation and slightly worse rotation accuracy than EpiSEgo. Again, the accuracy of the new solvers approaches the reference (BA) and outperforms the baselines Pradeep and Approx.
4.2 Real experiments
4.2.1 Matching between frames.
We use the processed and rectified grayscale stereo sequences of the KITTI dataset as input [11]. Given four views, we detect and match lines and points using the EDLines + LBD [28, 13] and ORB [12] algorithms implemented in OpenCV.
We evaluate one of our solvers EpiSEgo against the baselines Pradeep [8], Approx [9] and P3P [29]. The Pradeep method takes as input four-view point and line correspondences. The P3P method emulates the classical approach of visual SLAM and takes the three-view correspondences constructed from both views of the first camera and the first view of the second camera. The Approx and our EpiSEgo methods work with three-view correspondences. While Approx uses only point features, our method employs both types of features.
To match point features between two images and , for each feature from we find the closest one in by the descriptor distance. We reject the match if its reprojection error after triangulation is less than pixels. We match line segments in the same way, but without the reprojection validation. To find four-view correspondences, we match left and right images in both stereo pairs, and then match left images of the first and the second pairs. Denote the first stereo pair images as , and the second stereo pair images as . We find three view correspondences for each possible triplet of four images.
Then we run the classical RANSAC loop[2] with the threshold of pixels, and the initial outlier ratio of . For the three-view correspondences, we triangulate a feature using its main camera, project onto the remaining view and compare the reprojection error to . For the four-view correspondences, we choose one stereo pair, triangulate the feature using the projections onto its views, and then test the reprojection errors onto the views of the other stereo pair.
In Fig. 4 we show the results of the motion estimation experiment with the consecutive frame pairs of KITTI [30] sequence 6. The use of three-view correspondences and point and line features helps the EpiSEgo to achieve high inlier ratios and lower pose estimation errors compared to the baselines.
4.2.2 Integration into visual SLAM pipeline.
While the previous experiment compares stereo egomotion methods at the task of relative pose estimation between stereo pairs, we also validate that such task can be used to improve modern stereo visual odometry pipelines. For this, we evaluate the system that integrates the proposed EpiSEgo solver into the ORB-SLAM2 [31] pipeline.
The ORB-SLAM2 pipeline uses the previous frame pose as an initial guess to estimate the next frame pose within bundle adjustment. We modify it to run the EpiSEgo solver (point-only version) inside the RANSAC loop. The pose and the inliers estimated by RANSAC are used to initialize bundle adjustment. We run this algorithm each time the standard system loses the track. We do not include line features as they are absent in the original system.
While ORB-SLAM2 works well for the original sequences, it is important to study the robustness of the pipeline to the framerate decrease (which is equivalent to faster observer motion) which can happen in a real system. To do that, we drop every second frame of the sequence. Note that the uniform frame drop still enables the use of velocity-based pose prediction on which ORB-SLAM2 relies, provided the frames are separated by equal time periods. At the same time, it shows what can happen if motions become less predictable. Our experiments show that the ORB-SLAM2 often becomes unable to recover and loses the track, while the use of EpiSEgo solver can enable successful recovery from tracking losses. In Tab. 2, we show the results of 5 runs for the original and modified ORB-SLAM2 on 0-10 KITTI sequences. We report the percentage of failures as well as relative rotation and translation errors proposed by the dataset authors. The modified version does not lose track with probability more than 50% for all the sequences except the 9th, where a lack of tracked features in one moment is a possible problem. The original version is able to track with probability greater than 50% for only one sequence out of 11. The experiment shows that the integration of the stereo egomotion solver considerably increases the robustness of the system.
5 Summary
In this paper, we have proposed new minimal solvers that can handle the stereo relative pose problem for any combinations of point and line three-view correspondences. This case was not addressed in the previous literature. We demonstrate that the problem is practical and leads to improved performance of a well-known SLAM system.
1.1 Method details
1.1.1 Polynomial solvers for the ’hard’ cases (sect. 3.3)
Epipolar/Pluecker constraints
The EpiSEgo solver is based on Epipolar/Pluecker constraints (5), (7-8). The construction of the polynomial system is described in sect. 3.3. It involves three variables . The quotient ring defined by the polynomial ideal has 32 solutions. It has a following basis for all the feature/correspondence combinations:
[TABLE]
[TABLE]
[TABLE]
Point projection constraints
The PPSEgo solver is based on the point projection constraints (13-14). The polynomial polynomial system is constructed as described in the sect. 3.3. As verified by Maple[25], it has 16 solutions. The quotient ring defined by the polynomial ideal has a specific basis for a particular feature/correspondence combination. For the S1P1L-1P and S1P1L-1L cases it is
[TABLE]
while for the remaining ’hard’ cases S2P-1L, S2L-1P and S2P-1P it is
[TABLE]
1.1.2 Condition number check for quadric intersection (sect. 3.4, 3.5)
To solve the easy cases (sect. 3.4, 3.5) we use the technique described in [23]. We have a system with 3 unknowns We try to hide each of the variables. Then we compute the condition number of the system matrix (see formula (3) in the cited paper). We choose a matrix with a condition number closest to one by absolute value. We use for hiding the corresponding variable.
Both PPSEgo and EpiSEgo use quadric intersection for the easy cases. During synthetic experiments we use condition number check in EPiSEgo but omit it in PPSEgo. We do so only to demonstrate the importance of this additional step in some situations. In real experiments, the EpiSEgo uses the condition number check.
1.1.3 Degenerate configurations
Ambiguous configurations differ in 3D geometry while having identical projections of features. For the ambiguous configurations, the constraints become degenerate and the pose cannot be uniquely reconstructed. There are 3D feature configurations which constitute ambiguous configurations with anyhow placed cameras, such as three points on a line for the P3P problem, three parallel lines for the P3L problem or the features belonging to the same 3D line for the P1P2L and P2P1L problems. These configurations obviously remain ambiguous for all the problems considered in the paper. The analysis of all possible ambiguities/degeneracies in the set of all the possible combinations of feature/camera configurations is the part of future work.
1.2 Real experiments
Matching between frames.
We give here an additional illustration for the experiment with KITTI sequence 6 presented in the paper. We also show the same experiment, but with KITTI sequence 0. We reconstruct the motion trajectory by integration of the relative poses. In addition to the metrics presented in the main paper, we report the root mean-squared error (RMSE) w.r.t. the ground truth trajectory in the tab.2. The trajectory reconstructions for the sequences 6 and 0 are in the Fig.1, left and right respectively. The P3P method cannot produce a solution often so the trajectory for it is not shown. The cumulative distributions of the inlier share and angular rotation error in degrees for the KITTI sequence 0 are in the Fig.2. The Approx gives low pose estimation errors compared to other baselines, but it results in large trajectory reconstruction errors due to the employed rotation approximation. The proposed solver EpiSEgo gives the closest trajectory to the ground truth.
Execution time
We report median execution times for the compared methods (see Tab.1). As we described before, the tests were performed in C++ using our (P3P, Pradeep, SEgo) and authors’ (Approx) implementations. We evaluated them on a 2.3 GHz Core i5 laptop during matching between frames for the KITTI sequence 0.
Integration into visual SLAM pipeline
We present the (partial) trajectory reconstructions by the ORB-SLAM2 and ORB-SLAM2+SEgo pipelines for the experiments described in the paper, see fig.3. A very slow turning motion can still be tracked by the original pipeline, but the faster motions result in track loss (see sequence 0 trajectory). The modified system can keep track even during rapid motions.
1.3 Synthetic experiments
We include here the results of two more synthetic experiments. In the first one we analyze a planar case, when all the features belong to a common 3D plane. In the second one, variation of line length in the model is studied. We remind that BA is a reference method which is used to show best achievable accuracy. It is initialized with true parameter values and performs local optimization by fitting the reprojection cost to the noisy triplet feature projections. The PPSEgo and EpiSEgo both use the quadric intersection technique for the ’easy’ cases. The PPSEgo uses it ’as is’, while the EpiPSEgo employs the condition number check. We do so to show the benefits of using the condition number check.
For the planar case experiment, we generate a random 3D plane passing through the center of the model box and project all the geometric features onto it. The other parameters of the experiment remain unchanged. We vary the noise std.dev. from 0 to 1 pix. and get the results presented in the Fig. 5. The experiment shows that the new solvers work better than the baselines with two exceptions. For the easy cases Pradeep gives lower median translation error. It uses 4-matches of features and the proposed methods use minimal 3-matches. Even the reference method BA has higher error in this case, so it is the additional information that helps Pradeep to decrease the error. Also, in easy cases the EpiSEgo sometimes has lower accuracy than other methods. It shows importance of the condition check for the quadric intersection which is used by PPSEgo but is omitted in EpiSEgo.
In the next experiment we vary the average line length from 0.1 to 2.0. In this case, we generate the lines in the box with given expected length. The results are in the Fig.4. The accuracy grows with the line length for the cases in which there are line features in the minimal set. The new solvers are best in terms of accuracy except BA which has an unfair advantage being initialized with true parameter values.
1.3.1 Code details
We add to the supplemental materials the code reproducing our synthetic experiments in MATLAB. The scripts are described in the Tab.3. Please run mex_setup.m before executing them in order to compile the elimination template construction code. In order to include the Approx method into comparison, one needs to follow the instructions for C++ compilation in the README file.
This MATLAB code is available at https://github.com/alexandervakhitov/sego-paper-code.git.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Nister, D.: An efficient solution to the five-point relative pose problem. In: TPAMI, IEEE (2004) 756–770
- 2[2] Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. In: Readings in computer vision. Elsevier (1987) 726–740
- 3[3] Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., Reid, I., Leonard, J.J.: Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Transactions on Robotics 32 (6) (2016) 1309–1332
- 4[4] Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Curless, B., Seitz, S.M., Szeliski, R.: Building Rome in a day. Communications of the ACM 54 (10) (2011) 105–112
- 5[5] Micusik, B., Wildenauer, H.: Structure from motion with line segments under relaxed endpoint constraints. International Journal of Computer Vision 124 (1) (2017) 65–79
- 6[6] Xu, C., Zhang, L., Cheng, L., Koch, R.: Pose estimation from line correspondences: A complete analysis and a series of solutions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6) (2017) 1209–1222
- 7[7] Stewénius, H., Nistér, D., Oskarsson, M., Åström, K.: Solutions to minimal generalized relative pose problems. In: Workshop on omnidirectional vision. Volume 1. (2005) 3
- 8[8] Pradeep, V., Lim, J.: Egomotion estimation using assorted features. International Journal of Computer Vision 98 (2) (2012) 202–216
