Monocular Visual Odometry with a Rolling Shutter Camera

Chang-Ryeol Lee; Kuk-Jin Yoon

arXiv:1704.07163·cs.CV·April 25, 2017

Monocular Visual Odometry with a Rolling Shutter Camera

Chang-Ryeol Lee, Kuk-Jin Yoon

PDF

Open Access

TL;DR

This paper introduces a novel monocular visual odometry algorithm tailored for rolling shutter cameras, effectively addressing distortion issues caused by rapid or abrupt camera motions, and validated on synthetic and real datasets.

Contribution

The paper proposes a new RS essential matrix incorporating instantaneous velocities, improving ego-motion estimation accuracy for rolling shutter cameras under dynamic conditions.

Findings

01

Outperforms previous methods in accuracy and robustness

02

Effective in handling abrupt and fast camera motions

03

Validated on synthetic and real datasets

Abstract

Rolling Shutter (RS) cameras have become popularized because of low-cost imaging capability. However, the RS cameras suffer from undesirable artifacts when the camera or the subject is moving, or illumination condition changes. For that reason, Monocular Visual Odometry (MVO) with RS cameras produces inaccurate ego-motion estimates. Previous works solve this RS distortion problem with motion prediction from images and/or inertial sensors. However, the MVO still has trouble in handling the RS distortion when the camera motion changes abruptly (e.g. vibration of mobile cameras causes extremely fast motion instantaneously). To address the problem, we propose the novel MVO algorithm in consideration of the geometric characteristics of RS cameras. The key idea of the proposed algorithm is the new RS essential matrix which incorporates the instantaneous angular and linear velocities at each…

Figures40

Click any figure to enlarge with its caption.

Tables5

Table 1. (a) Noise-free

			Lev 1	Lev 2	Lev 3	Lev 4	Lev 5	Lev 6
Mean	Rotation(deg)	MVO	0.000	0.686	1.266	1.478	1.805	2.093
	Rotation(deg)	MRSVO	0.007	0.036	0.041	0.052	0.373	0.475
	Translation (m)	MVO	0.007	0.119	0.203	0.233	0.296	0.284
	Translation (m)	MRSVO	0.012	0.026	0.038	0.053	0.085	0.078
Standard deviation	Rotation(deg)	MVO	0.000	0.889	1.389	1.116	1.074	1.513
	Rotation(deg)	MRSVO	0.009	0.076	0.137	0.448	0.341	0.368
	Translation(m)	MVO	0.015	0.129	0.141	0.147	0.157	0.160
	Translation(m)	MRSVO	0.015	0.121	0.116	0.125	0.167	0.139
Inlier ratio (%)		MVO	100.0	44.9	34.4	30.2	27.4	25.4
Inlier ratio (%)		MRSVO	100.0	99.9	99.9	99.5	97.9	97.2

Table 2. (a) Noise-free

			Lev 1	Lev 2	Lev 3	Lev 4	Lev 5	Lev 6
Mean	Rotation(deg)	MVO	0.000	0.686	1.266	1.478	1.805	2.093
	Rotation(deg)	MRSVO	0.007	0.036	0.041	0.052	0.373	0.475
	Translation (m)	MVO	0.007	0.119	0.203	0.233	0.296	0.284
	Translation (m)	MRSVO	0.012	0.026	0.038	0.053	0.085	0.078
Standard deviation	Rotation(deg)	MVO	0.000	0.889	1.389	1.116	1.074	1.513
	Rotation(deg)	MRSVO	0.009	0.076	0.137	0.448	0.341	0.368
	Translation(m)	MVO	0.015	0.129	0.141	0.147	0.157	0.160
	Translation(m)	MRSVO	0.015	0.121	0.116	0.125	0.167	0.139
Inlier ratio (%)		MVO	100.0	44.9	34.4	30.2	27.4	25.4
Inlier ratio (%)		MRSVO	100.0	99.9	99.9	99.5	97.9	97.2

Table 3. (b) Gaussian noise

			Lev 1	Lev 2	Lev 3	Lev 4	Lev 5	Lev 6
Mean	Rotation (deg)	MVO	0.349	0.704	1.371	1.844	1.955	2.252
	Rotation (deg)	MRSVO	0.186	0.482	0.657	0.766	0.859	1.045
	Translation (m)	MVO	0.027	0.085	0.167	0.235	0.249	0.282
	Translation (m)	MRSVO	0.017	0.033	0.055	0.097	0.100	0.106
Standard deviation	Rotation (deg)	MVO	0.189	0.346	0.778	0.994	1.111	1.244
	Rotation (deg)	MRSVO	0.134	0.278	0.413	0.499	0.608	0.890
	Translation (m)	MVO	0.028	0.047	0.103	0.131	0.148	0.138
	Translation (m)	MRSVO	0.014	0.021	0.088	0.140	0.148	0.131
Inlier ratio(%)		MVO	23.2	21.5	19.4	18.4	18.0	17.4
Inlier ratio(%)		MRSVO	51.9	50.0	49.2	48.5	47.9	47.7

Table 4. (c) Laplacian noise

			Lev 1	Lev 2	Lev 3	Lev 4	Lev 5	Lev 6
Mean	Rotation (deg)	MVO	0.401	0.805	1.337	1.675	2.017	2.421
	Rotation (deg)	MRSVO	0.541	0.820	0.922	0.930	1.064	1.066
	Translation (m)	MVO	0.032	0.089	0.168	0.204	0.257	0.284
	Translation (m)	MRSVO	0.047	0.073	0.100	0.096	0.116	0.121
Standard deviation	Rotation (deg)	MVO	0.246	0.427	0.813	0.943	1.270	1.370
	Rotation (deg)	MRSVO	0.641	0.741	0.865	0.875	1.015	0.853
	Translation (m)	MVO	0.036	0.057	0.118	0.132	0.150	0.147
	Translation (m)	MRSVO	0.061	0.071	0.122	0.123	0.138	0.148
Inlier ratio(%)		MVO	9.4	8.7	8.0	7.6	7.4	7.0
Inlier ratio(%)		MRSVO	55.2	54.7	53.4	53.2	52.9	50.9

Table 5. Table 2: Average inlier ratios of the MVO and the MRSVO on real RS dataset. MVO ∗ superscript MVO \text{MVO}^{*} indicates the results of the MVO on real GS dataset.

Sequence	1	2	3	7	9	10	11	12	20	21	All
Frame	445	655	68	44	45	130	50	82	75	60	1714
MVO	61.7%	50.5%	51.1%	54.4%	35.2%	54.4%	17.1%	48.7%	52.5%	46.9%	47.2%
MRSVO	75.5%	71.6%	67.9%	66.5%	49.1%	56.7%	32.6%	67.7%	67.4%	72.4%	62.9%
MVO*	69.6%	58.0%	56.6%	52.5%	44.4%	52.0%	31.2%	47.7%	57.0%	57.0%	52.5%

Equations35

m_{g s, i} = [c_{i} r_{i}]_{g s} \sim K [R ∣ t] X_{i}^{W},

m_{g s, i} = [c_{i} r_{i}]_{g s} \sim K [R ∣ t] X_{i}^{W},

m_{r s, i} = [c_{i} r_{i}]_{r s} \sim K [R_{r s} (r_{i}) t_{r s} (r_{i})] [R 0 t 1] X_{i}^{W} .

m_{r s, i} = [c_{i} r_{i}]_{r s} \sim K [R_{r s} (r_{i}) t_{r s} (r_{i})] [R 0 t 1] X_{i}^{W} .

R_{r s} (r_{i}) ≃ R_{r s} (r_{i} w) = I_{3} + r_{i} τ 0 w_{z} - w_{y} - w_{z} 0 w_{x} w_{y} - w_{x} 0,

R_{r s} (r_{i}) ≃ R_{r s} (r_{i} w) = I_{3} + r_{i} τ 0 w_{z} - w_{y} - w_{z} 0 w_{x} w_{y} - w_{x} 0,

t_{r s} (r_{i}) ≃ t_{r s} (r_{i} v) = r_{i} τ v_{x} v_{y} v_{z},

t_{r s} (r_{i}) ≃ t_{r s} (r_{i} v) = r_{i} τ v_{x} v_{y} v_{z},

m_{r s, i} \sim K [R_{r s} (r_{i} w) t_{r s} (r_{i} v)] [H^{- 1} 0 01] [K^{- 1} 0 01] [\tilde{m}_{g s, i} 1],

m_{r s, i} \sim K [R_{r s} (r_{i} w) t_{r s} (r_{i} v)] [H^{- 1} 0 01] [K^{- 1} 0 01] [\tilde{m}_{g s, i} 1],

(\tilde{m}_{g s}^{C_{t - 1}})^{T} F_{g s} \tilde{m}_{g s}^{C_{t}} = 0, F_{g s} = K^{- T} E_{g s} K^{- 1}, E_{g s} = ⌊ t_{g s} ⌋_{\times} R_{g s},

(\tilde{m}_{g s}^{C_{t - 1}})^{T} F_{g s} \tilde{m}_{g s}^{C_{t}} = 0, F_{g s} = K^{- T} E_{g s} K^{- 1}, E_{g s} = ⌊ t_{g s} ⌋_{\times} R_{g s},

(\tilde{m}_{r s}^{C_{t - 1}})^{T} F_{r s} \tilde{m}_{r s}^{C_{t}} = 0, F_{r s} = K^{- T} E_{r s} K^{- 1}, (\tilde{x}_{r s}^{C_{t - 1}})^{T} E_{r s} \tilde{x}_{r s}^{C_{t}} = 0,

(\tilde{m}_{r s}^{C_{t - 1}})^{T} F_{r s} \tilde{m}_{r s}^{C_{t}} = 0, F_{r s} = K^{- T} E_{r s} K^{- 1}, (\tilde{x}_{r s}^{C_{t - 1}})^{T} E_{r s} \tilde{x}_{r s}^{C_{t}} = 0,

(x_{r s}^{C_{t - 1}})^{T} E_{r s} x_{r s}^{C_{t}} = 0,

(x_{r s}^{C_{t - 1}})^{T} E_{r s} x_{r s}^{C_{t}} = 0,

x_{r s}^{C_{t - 1}} = [R_{r s}^{C_{t - 1}} t_{r s}^{C_{t - 1}}]^{- 1} [R_{g s} 0 t_{g s} 1]^{- 1} X_{g s}^{C_{t}} s . t . X_{g s}^{C_{t}} = [R_{g s} 0 t_{g s} 1] X_{g s}^{C_{t - 1}},

x_{r s}^{C_{t - 1}} = [R_{r s}^{C_{t - 1}} t_{r s}^{C_{t - 1}}]^{- 1} [R_{g s} 0 t_{g s} 1]^{- 1} X_{g s}^{C_{t}} s . t . X_{g s}^{C_{t}} = [R_{g s} 0 t_{g s} 1] X_{g s}^{C_{t - 1}},

x_{r s}^{C_{t}} = [R_{r s}^{C_{t}} t_{r s}^{C_{t}}]^{- 1} X_{g s}^{C_{t}},

x_{r s}^{C_{t}} = [R_{r s}^{C_{t}} t_{r s}^{C_{t}}]^{- 1} X_{g s}^{C_{t}},

([R_{r s}^{C_{t - 1}} t_{r s}^{C_{t - 1}}]^{- 1} [R_{g s} 0 t_{g s} 1]^{- 1} X_{g s}^{C_{t}})^{T} E_{r s} ([R_{r s}^{C_{t}} t_{r s}^{C_{t}}]^{- 1} X_{g s}^{C_{t}}) = 0.

([R_{r s}^{C_{t - 1}} t_{r s}^{C_{t - 1}}]^{- 1} [R_{g s} 0 t_{g s} 1]^{- 1} X_{g s}^{C_{t}})^{T} E_{r s} ([R_{r s}^{C_{t}} t_{r s}^{C_{t}}]^{- 1} X_{g s}^{C_{t}}) = 0.

E_{r s} = (R_{r s}^{C_{t - 1}})^{T} R_{g s} ⌊ t_{g s} - t_{r s}^{C_{t}} + R_{g s} t_{r s}^{C_{t - 1}} ⌋_{\times} R_{r s}^{C_{t}} .

E_{r s} = (R_{r s}^{C_{t - 1}})^{T} R_{g s} ⌊ t_{g s} - t_{r s}^{C_{t}} + R_{g s} t_{r s}^{C_{t - 1}} ⌋_{\times} R_{r s}^{C_{t}} .

x = [q_{g s}, t_{g s}, w^{C_{t - 1}}, v^{C_{t - 1}}, w^{C_{t}}, v^{C_{t}}]^{T},

x = [q_{g s}, t_{g s}, w^{C_{t - 1}}, v^{C_{t - 1}}, w^{C_{t}}, v^{C_{t}}]^{T},

\tilde{x} = [δ θ_{g s} δ t_{g s} δ w^{C_{t - 1}} δ v^{C_{t - 1}} δ w^{C_{t}} δ v^{C_{t}}],

\tilde{x} = [δ θ_{g s} δ t_{g s} δ w^{C_{t - 1}} δ v^{C_{t - 1}} δ w^{C_{t}} δ v^{C_{t}}],

\tilde{x} min i \sum 17 \frac{( m _{r s, i}^{C_{t}} ) ^{T} F _{r s} ( x ~ ) m _{r s, i}^{C_{t - 1}} _{2}^{2}}{( F _{r s} ( x ~ ) m _{r s, i}^{C_{t - 1}} ) _{c}^{2} + ( F _{r s} ( x ~ ) m _{r s, i}^{C_{t - 1}} ) _{r}^{2} + ( F _{r s} ( x ~ ) m _{r s, i}^{C_{t}} ) _{c}^{2} + ( F _{r s} ( x ~ ) m _{r s, i}^{C_{t}} ) _{r}^{2}},

\tilde{x} min i \sum 17 \frac{( m _{r s, i}^{C_{t}} ) ^{T} F _{r s} ( x ~ ) m _{r s, i}^{C_{t - 1}} _{2}^{2}}{( F _{r s} ( x ~ ) m _{r s, i}^{C_{t - 1}} ) _{c}^{2} + ( F _{r s} ( x ~ ) m _{r s, i}^{C_{t - 1}} ) _{r}^{2} + ( F _{r s} ( x ~ ) m _{r s, i}^{C_{t}} ) _{c}^{2} + ( F _{r s} ( x ~ ) m _{r s, i}^{C_{t}} ) _{r}^{2}},

s . t F_{r s} (\tilde{x}) = K^{- T} E_{r s} (\tilde{x}) K^{- 1} .

δ T = [\hat{δ R} 0_{3 \times 1} δ t 1] = T_{g t} T_{es t}^{- 1}, δ θ = f (\hat{δ R}), δ t = ∥ δ t ∥_{2},

δ T = [\hat{δ R} 0_{3 \times 1} δ t 1] = T_{g t} T_{es t}^{- 1}, δ θ = f (\hat{δ R}), δ t = ∥ δ t ∥_{2},

T_{k + 1} = T_{k} [\hat{R}_{g s} 0_{3 \times 1} \hat{t}_{g s} 1]^{- 1}, T_{0} = I_{4 \times 4} .

T_{k + 1} = T_{k} [\hat{R}_{g s} 0_{3 \times 1} \hat{t}_{g s} 1]^{- 1}, T_{0} = I_{4 \times 4} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Optical measurement and interference techniques

Full text

11institutetext: School of Electrical Engineering and Computer Science,

Gwanju Institue Science and Technology

11email: {crlee,kjyoon}@gist.ac.kr

Monocular Visual Odometry with a Rolling Shutter Camera

Chang-Ryeol Lee and Kuk-Jin Yoon

Abstract

Rolling Shutter (RS) cameras have become popularized because of low-cost imaging capability. However, the RS cameras suffer from undesirable artifacts when the camera or the subject is moving, or illumination condition changes. For that reason, Monocular Visual Odometry (MVO) with RS cameras produces inaccurate ego-motion estimates. Previous works solve this RS distortion problem with motion prediction from images and/or inertial sensors. However, the MVO still has trouble in handling the RS distortion when the camera motion changes abruptly (e.g. vibration of mobile cameras causes extremely fast motion instantaneously). To address the problem, we propose the novel MVO algorithm in consideration of the geometric characteristics of RS cameras. The key idea of the proposed algorithm is the new RS essential matrix which incorporates the instantaneous angular and linear velocities at each frame. Our algorithm produces accurate and robust ego-motion estimates in an online manner, and is applicable to various mobile applications with RS cameras. The superiority of the proposed algorithm is validated through quantitative and qualitative comparison on both synthetic and real dataset.

Keywords:

Monocular Visual Odometry, Rolling Shutter Cameras, Ego-motion Estimation

1 Introduction

Odometry that estimates 6-DOF ego-motion is a crucial technology for mobile applications and robotics applications. Visual Odometry (VO) using cameras has been extensively studied for robot navigation [1] and autonomous driving [2] for decades. Practically, VO has distinct advantages in GPS-denied environments such as urban, military, underwater and indoor areas, and provides less drifted results compared to Wheel Odometry (WO) and Inertial Odometry (IO). Especially, Monocular Visual Odometry (MVO) has been actively studied for a decade because of its compactness and price competitiveness [3].

For the MVO, Rolling Shutter (RS) cameras, which capture the image line-by-line, are more preferable than Global Shutter (GS) cameras, which capture all image lines at once, because of the low-cost imaging capability. However, the MVO with an RS camera becomes a challenging problem when the camera (or subject) is moving or illumination condition changes, because the all lines of an RS image are obtained from different poses. The pose changing during the RS image capturing process causes unmodeled noises and outliers in feature points-based ego-motion estimation. To correct the RS artifacts, researchers exploit the predicted motion information from temporally neighboring frames with the assumption on smooth camera motion [4] [5] and/or inertial sensors [6] [7]. However, it is difficult to predict the RS artifacts when the camera motion changes abruptly as in hand-held and/or vehicle-attached cameras (e.g. vibration of the car due to the uneven ground plane), because the camera motion changes extremely fast instantaneously. This unpredictable RS artifact dramatically reduces the number of inliers in ego-motion estimation and results in inaccurate and inconsistent ego-motion estimation. Figure 1 shows a distorted RS image and the effect of RS artifacts on ego-motion estimation (left).

In this paper, we propose a novel MVO algorithm in consideration of the geometric characteristics of RS cameras. In the proposed algorithm, we jointly estimate the relative camera transformation between two frames and instantaneous camera motion, which consists of the linear and the angular velocities, at each frame. The key idea of the proposed MVO algorithm is the new RS essential matrix which incorporates the instantaneous angular and linear velocities at each frame. The RS essential matrix is highly nonlinear and has 17-DOF (relative rotation/translation and RS linear and angular velocities), and we adopt the Levenberg-Marquardt algorithm to estimate relative camera transformation and instantaneous camera motion from the RS essential matrix. By using the RS essential matrix, the proposed algorithm provides a larger number of inliers than conventional MVO as shown in Fig. 1. Consequently, our algorithm produces accurate and robust ego-motion estimates in an online manner, and it is applicable to various mobile applications. The proposed approach can also handle RS artifacts caused by smooth and/or abrupt motion. Besides, our work can be exploited to provide an initial solution for time-delayed/off-line MVO algorithms. The contributions of this paper can be summarized as follows.

$\bullet$ Introduction of the instantaneous camera motion model for the RS camera problem,

$\bullet$ Formulation of the RS essential matrix,

$\bullet$ Proposition of the joint estimation algorithm of relative pose and instantaneous motion with only two images.

This paper is organized as follows. We review related works in Sec. 2. Then, we define the terminologies to make formulation and explanation clear in Sec. 3. The RS camera geometry is then explained in Sec. 4. We describe the RS essential matrix which incorporates RS camera geometry in Sec. 5, and explain how to estimate the relative rotation and translation from the RS essential matrix in Sec. 6. We show experimental results in Sec. 7. Finally, we discuss the limitation and future works, and conclude the paper in Sec. 8.

2 Related Works

The RS effect has been dealt with in several computer vision problems, such as Perspective-n-Point (PnP) problem that requires 3D point clouds generated by a GS camera. Ait-Aider *et al. *estimated the pose and velocity of fast moving objects in a single image with a rolling shutter camera [8]. The nonlinear and linear models were proposed for general and planar objects, respectively. Magerand *et al. *extended this work with a polynomial rolling shutter model and the constrained global optimization [9] [10]. Albl *et al. *proposed a double linearized rolling shutter for an efficient estimation [11]. This work remarkably increased the accuracy of motion estimation and the number of inliers.

On the other hand, inertial sensors are powerful options when estimating ego-motion with an RS camera. Karpenko *et al. *exploited gyroscopes to correct RS effects and stabilized video on a smart-phone [6]. Synchronization between a gyroscope and a camera was performed by comparing angular velocity measurements obtained from the gyroscope and feature displacements computed from the video. Jia and Evans estimated camera orientation on the Bayesian estimation framework with inertial measurements, and also corrected RS effects of images [7]. Guo *et al. *proposed the ego-motion estimation framework by fusing an inertial sensor and a camera while considering artifacts such as RS effects and synchronization between inertial measurements and images occurred in mobile devices [12].

Handling RS distortion for ego-motion estimation only with a monocular RS camera (without using inertial sensors) has been also studied. Klein and Murray corrected the RS effect by using the camera velocity estimated in the ego-motion estimation framework [4]. Since this work focused on the efficient of an algorithm, a simple strategy was selected. Hedborg *et al. *proposed to use temporal information from an image sequence [13] [5]. The proposed RS bundle adjustment is a powerful scheme, but it is time-consuming and an off-line algorithm. Saurer *et al. *handled the RS distortion for the 3D reconstruction problem because the RS effects and the 3D structure of scenes are highly related to each other [14].

Our work is also based only on a monocular RS camera and estimates ego-motion in consideration of the RS effects in an online manner (i.e. two-frame relative motion estimation). Consequently, unlike previous works, the proposed algorithm does not require any of 3D point clouds, additional inertial sensors, and temporal image information (i.e. video).

3 Terminology

Before presenting the RS camera geometry and the proposed algorithm, we define several terms on the two-view geometry of an RS camera in Fig. 2. The RS image is the image containing the RS distortion (a GS image $+$ the RS distortion), and the subscript $gs$ and $rs$ indicate a global shutter camera and a rolling shutter camera, respectively. The transformation between two RS images is composed of one GS transformation and two RS transformation defined as follows.

1) GS transformation: rotation $\mathbf{R}_{gs}\in SO(3)$ and translation $\mathbf{t}_{gs}\in\mathbb{R}^{3}$ between the two undistorted GS images from $t$ frame to $t-1$ frame.

2) RS transformation: row-wise rotation $\mathbf{R}_{rs}(r_{i})\in SO(3)$ and translation $\mathbf{t}_{rs}(r_{i})\in\mathbb{R}^{3}$ from the distorted RS image to the undistorted GS image at frame $t$ (or $t-1$ ) (refer Fig. 2), where $r_{i}$ denotes the row of a feature point, the subscript $i$ is the index of a feature point. The time range of the RS transformation in an image is from [math] to $\left(n_{row}-1\right)\cdot\tau$ , where $n_{row}$ is the number of rows and $\tau$ is the exposure time for one line.

3) Instantaneous RS motion: angular and linear velocities $\mathbf{w}\in\mathbb{R}^{3}$ , $\mathbf{v}\in\mathbb{R}^{3}$ around $r_{i}=0$ .

4 Rolling Shutter Camera Geometry

In general GS camera geometry, a 3D point $\mathbf{X}_{i}^{W}\in\mathbb{P}^{3}$ in the homogeneous world coordinate is projected to a 2D point $\mathbf{m}_{i}\in\mathbb{R}^{2}$ in the image coordinate with camera intrinsic parameters (focal length, skew, position of a principal point) and extrinsic parameters (rotation and translation). Here, superscript $W$ indicates the world coordinate. This projection can be described as

[TABLE]

where $\mathbf{K}\in\mathbb{R}^{3\times 3}$ is an intrinsic matrix, $\mathbf{R}\in SO(3)$ is a rotation matrix, and $\mathbf{t}\in\mathbb{R}^{3}$ is a translation vector with respect to the world coordinate. The RS camera geometry is derived from this well-known GS camera geometry.

In an RS camera, a 3D point $\mathbf{X}_{i}^{W}\in\mathbb{P}^{3}$ is projected to the 2D point $\mathbf{m}_{rs,i}\in\mathbb{R}^{2}$ with the line-by-line exposure, that is, the RS transformation as

[TABLE]

The RS transformation $[\mathbf{R}_{rs}(r_{i})$ , $\mathbf{t}_{rs}(r_{i})]$ is approximated as linear functions of the image row $r_{i}$ with instantaneous RS motion $\mathbf{w},\mathbf{v}$ as

[TABLE]

where $\tau$ is an exposure time for one line of an RS camera.

The relation between the GS image and the RS image can be expressed as

[TABLE]

where $\mathbf{H}\in\mathbb{R}^{3\times 3}$ is a back-projection matrix and $\tilde{\mathbf{m}}_{i}\in\mathbb{R}^{3}$ is the position of a feature point in the normalized image coordinates (i.e. $\tilde{\mathbf{m}}_{i}=\left[\ c_{i},\ r_{i},\ 1\ \right]^{T}$ ). Green arrows in Fig. 2 indicate the transformation between feature correspondences in the GS and the RS images.

5 Rolling Shutter Essential Matrix

To estimate ego-motion in an online manner, we focus on the relative transformation between two consecutive frames. Efficient MVO can be achieved up to scale by concatenating the relative transformation estimates [15]. Estimating relative transformation with a GS camera is well-formulated in the fundamental/essential matrix estimation problem[16] as

[TABLE]

where superscript $C$ indicates the camera coordinate, the subscript $t-1$ and $t$ indicate time indices of two consecutive frames, and $\mathbf{F}_{gs}\in\mathbb{R}^{3\times 3}$ is a GS fundamental matrix.

The 2D feature points $\tilde{\mathbf{m}}_{rs}^{C_{t-1}}$ , $\tilde{\mathbf{m}}_{rs}^{C_{t}}$ in two consecutive RS images and RS fundamental matrix $\mathbf{F}_{rs}$ satisfy the following constraint:

[TABLE]

where $\tilde{\mathbf{x}}\in\mathbb{P}^{2}$ is a feature point in the homogeneous camera coordinate (i.e. $\tilde{\mathbf{x}}=\left[\ x,\ y,\ 1\ \right]^{T}=\mathbf{K}^{-1}\tilde{\mathbf{m}}$ )

Now, we derive the essential matrix $\mathbf{E}_{rs}$ which incorporates instantaneous RS motion at each frame, that is, the RS essential matrix. With the given scale of feature points, the RS essential matrix satisfy following constraint:

[TABLE]

where $\mathbf{x}^{C_{t}}=\mathbf{H}\tilde{\mathbf{x}}^{C_{t}}\in\mathbb{R}^{3}$ is a feature point in the camera coordinate. The feature points $\mathbf{x}_{rs}^{C_{t-1}}$ and $\mathbf{x}_{rs}^{C_{t}}$ in the RS camera coordinates are expressed with respect to the feature points $\mathbf{x}_{gs}^{C_{t}}$ in the GS camera coordinate. For brevity, we denote $\mathbf{R}_{rs}(r_{i}\mathbf{w}^{C_{t}})$ and $\mathbf{t}_{rs}(r_{i}\mathbf{v}^{C_{t}})$ as $\mathbf{R}_{rs}^{C_{t}}$ and $\mathbf{t}_{rs}^{C_{t}}$ . Then,

[TABLE]

where $\mathbf{X}^{C}\in\mathbb{P}^{3}$ is the feature point in the homogeneous camera coordinate.

The 3D feature points in RS two consecutive frames are converted to 3D points in the GS camera with Eq. (9) and Eq. (10). Thus, the constraint Eq. (8) is converted to

[TABLE]

Finally, we obtain the RS essential matrix whose DOF is 17: one GS transformation (rotation/translation) and two RS transformations (angular/linear velocities) up to scale $(3+3-1+(3+3)\times 2=17)$ as

[TABLE]

6 Estimation of Rolling Shutter Camera Motion

The conventional essential matrix estimation is performed by Direct Linear Transformation (DLT), and rotation $\mathbf{R}_{gs}$ and translation $\mathbf{t}_{gs}$ are extracted by Singular Value Decomposition (SVD) [16]. As described in Sec. 1, the conventional motion estimation becomes inaccurate with RS images. Thus, we jointly estimate the GS and RS camera motion. The overall process of the proposed Monocular Rolling Shutter Visual Odometry (MRSVO) is described in Algorithm 1. The constraint Eq. (8) between RS feature points is highly nonlinear. For that reason, we estimate the solution of this nonlinear equation with the Levenberg-Marquardt algorithm.

From the RS essential matrix, the nominal state variables to estimate $\mathbf{x}\in\mathbb{R}^{19}$ are defined as

[TABLE]

where $\mathbf{q}_{gs}\in\mathbb{R}^{4}$ denotes a quaternion which expresses the 3-DOF rotation $\mathbf{R}_{gs}$ .

The error states $\tilde{\mathbf{x}}\in\mathbb{R}^{18\times 1}$ inspired from [17] is introduced to reduce the complexity as

[TABLE]

where $\mathbf{\delta\theta}_{gs}\in\mathbb{R}^{3\times 1}$ is the angular error of GS rotation which is originated from error quaternion $\delta\mathbf{q}\simeq\left[\ 1\ \ \frac{1}{2}\delta\theta^{T}\ \right]^{T}$ .

The initial nominal states $\mathbf{q}_{gs},\ \mathbf{t}_{gs}$ are obtained from the conventional DLT and SVD algorithm. We set the initial state to zero to handle arbitrary RS distortions.

The cost function is constructed as the Sampson error as

[TABLE]

where the subscript $c$ and $r$ indicate the column and the row of a feature point in the normalized camera coordinate, respectively.

The nonlinear least square methods such as the LM algorithm is easy to converge to local minimum. To avoid bad initialization as well as outliers, we perform the RANSAC process. We give 20 feature point correspondences as the input of the RANSAC for robustness.

7 Experimental Results

To validate the proposed RS camera motion estimation algorithm, we perform several experiments on synthetic dataset as well as real dataset. Since there is no previous work that specifically focuses on the RS distortion for the online ego-motion estimation to the best of our knowledge, we compare the proposed algorithm with the conventional MVO based on the fundamental matrix with the normalized 8-point algorithm [16]. The quantitative and qualitative comparisons on both datasets explicitly show the superiority of proposed algorithm. In the synthetic dataset, we focus on analyzing the influence of feature tracking error due to RS effects on relative motion estimates with two consecutive RS images. It empirically reveals that the inlier ratio and the accuracy of relative motion estimates are highly correlated. In the real dataset, we verify the superiority of our algorithm in several sequences taken by a hand-held RS camera.

7.1 Synthetic data

We generate synthetic data with a smooth trajectory and 3D points as shown in Fig. 3 . The trajectory is manually specified and interpolated with a spline function, and then we put the random noise to simulate realistic camera motion. Traveling distance is about 80 $(m)$ . All visible 3D points are projected onto the image. The 250 image are captured at 5 Hz. The resolution of the image is 1280 $\times$ 720 and the focal length is 5 mm. The number of feature points is limited up to 500 with a bucketing strategy to obtain evenly distributed feature points. In addition, arbitrary RS distortion is added to 2D feature points in every image. The RS distortion depends on one-line exposure time $\tau$ of an RS camera, instantaneous linear and angular velocity $\mathbf{v}_{rs},\mathbf{w}_{rs}$ in a frame. In our experiments, we fix $\tau$ to 50 $\mu s$ and make $\mathbf{v}_{rs},\mathbf{w}_{rs}$ change from 0 to [50,100] $(m/s,deg/s)$ since $\tau$ and $\mathbf{v}_{rs},\mathbf{w}_{rs}$ are inverse-correlated. The RS distortion is generated at 6 levels ([0,0], [10,20], [20,40], [30,60], [40,80], [50,100] $(m/s,deg/s)$ ), and the estimates are evaluated with 100 Monte Carlo simulations. The error is evaluated every two frames, and they are averaged for the whole sequence. Rotation and translation errors are evaluated with metrics defined as

[TABLE]

where $f$ is Rodrigues’ rotation formula.

The conventional MVO and the proposed MRSVO estimate relative motion up to scale. Therefore, we utilize the ground-truth scale information for accurate evaluation. Furthermore, for the convenience of the comparison, we name the conventional MVO which estimates ego-motion with the essential matrix [16] simply as MVO.

7.1.1 Comparison without feature tracking noise

To validate the effect of the proposed algorithm on RS distortion, we evaluate the MRSVO with true 2D feature correspondences. Figure 4a shows that the MRSVO outperforms the MVO. Rotation and translation errors of both the MVO and the MRSVO linearly increase as the level of RS distortion increases. However, the rate of change of the MRSVO is much smaller than that of the MVO. More detailed results are described in Table 1a with the average inlier ratio. At level 1 (no RS distortion), the inlier ratios of the MRSVO and the MVO are 100 $\%$ when the number of tracked points is 500. As the level of the RS distortion increases, the inlier ratio of the MVO dramatically decreases (up to 25.4 $\%$ ). On the contrary, the inlier ratio of the MRSVO is reduced little (above 97.2 $\%$ ), and the proposed MRSVO with high inlier ratios shows the lower estimation error. Consequently, we can evaluate the performance of the MRSVO using the inlier ratio in real dataset.

Even though the input feature points are noise-free, the high level of RS distortion degrades the MRSVO because the MRSVO exploits the estimate of the MVO as initial states. To handle this, we apply the RANSAC process to avoid bad initialization as described in Sec. 6. Figure 5 shows the performance of the MRSVO with RANSAC and without RANSAC. The effect of RANSAC is clearly shown in higher levels of RS distortion as expected.

7.1.2 Comparison with feature tracking noise

We evaluate the MRSVO and the MVO under Gaussian and Laplacian noises. The standard deviation of two types of noise is set to 1. The randomly generated noises are added to positions of 2D feature point correspondences. Figure 4b and Fig. 4c show that the MRSVO outperforms the MVO under both types of noise. The estimation errors show a similar tendency with the noise-free case. Table 1b and 1c describe the mean and standard deviation values of errors and the average inlier ratio. The inlier ratios of both the MRSVO and the MVO in the Gaussian and Laplacian noise environments are decreased compared to the noise-free case. In the Gaussian noise case, both the mean and the standard deviation of the MRSVO are lower than those of the MVO at all the levels of the RS distortion. Besides, we can notice that the MRSVO produces more accurate estimates and higher inlier ratios than the MVO at level 1 (no RS distortion) in the Gaussian noise case. It means that the MRSVO effectively suppresses the noise of feature points as well as RS distortion. In the case of Laplacian noise, the MRSVO outperforms the MVO in the high levels (4-6) of the RS distortion. However, the MVO provides slightly more accurate estimates than the MRSVO in the low levels (1-3) of the RS distortion, because a large number of outliers owing to the Laplacian noise sometimes lead to wrong convergence of the LM algorithm.

The relative motion estimates from two consecutive frames are concatenated to construct the motion trajectory as

[TABLE]

Figure 6 shows camera trajectories with six levels of RS distortion with Laplacian noise of feature points. In the high levels of distortion, the MRSVO produces more stable trajectories compare to the MVO.

7.2 Real dataset

We evaluate the MRSVO with the real dataset captured by commercial smart-phones cameras. Here, we adopt the inlier ratio as an evaluation metric since the inlier ratio is highly correlated with the estimation accuracy.

We compare the MRSVO to the MVO on the Hedborg’s dataset [5]. The dataset contains 36 sequences which are composed of 2 long and 34 short videos. The sequences were captured with iPhone 4 equipped with an RS camera and a cannon GS camera. We evaluate the MRSVO with selected 10 distinct sequences among 36 redundant sequences. Table 2 describes the average inlier ratios of the MRSVO and the MVO. The MRSVO produces about 15 $\%$ higher inlier ratio on average. In the GS camera, the inlier ratio is increased about 5 $\%$ from the MVO. The inlier ratio of the MRSVO is about 10 $\%$ larger than those of the MVO because the MRSVO suppresses the noise as well as RS distortion. Figure 7 shows that the number of inliers in test sequences. The MRSVO provides larger numbers of inliers compared to the MVO overall. Figure 8 compares the inliers of the MRSVO and the MVO on the images in sequence 21. Inlier ratios annotated on images clearly demonstrate the superiority of the MRSVO.

8 Conclusion

The MVO with an RS camera suffers from the undesirable artifacts when the camera moves quickly. To resolve this problem, we proposed a novel MVO algorithm considering the geometric characteristics of RS camera. The main idea is the joint estimation of the relative transformation and instantaneous camera motion in two consecutive images. The proposed algorithm provides accurate ego-motion in an online manner in the presence of severe RS distortions. The superiority of the proposed algorithm has been verified through expensive experiments on synthetic and real datasets. The results of the proposed algorithm can be utilized as an initial state of offline/time-delayed ego-motion estimation algorithms.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Moravec, H.P.: Visual mapping by a robot rover. In: International Joint Conference on Artificial Intelligence. (1979)
2[2] Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR. (2012)
3[3] Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: Monoslam: Real-time single camera slam. Pattern Analysis and Machine Intelligence, IEEE Transactions on 29 (6) (2007) 1052–1067
4[4] Klein, G., Murray, D.: Parallel tracking and mapping on a camera phone. In: ISMAR. (2009)
5[5] Hedborg, J., Forssén, P.E., Felsberg, M., Ringaby, E.: Rolling shutter bundle adjustment. In: CVPR. (2012)
6[6] Karpenko, A., Jacobs, D., Baek, J., Levoy, M.: Digital video stabilization and rolling shutter correction using gyroscopes. Stanford University Computer Science Tech Report 1 (2011) 2
7[7] Jia, C., Evans, B.L.: Probabilistic 3-d motion estimation for rolling shutter video rectification from visual and inertial measurements. In: IEEE 14th International Workshop on Multimedia Signal Processing (MMSP). (2012)
8[8] Ait-Aider, O., Andreff, N., Lavest, J.M., Martinet, P.: Simultaneous object pose and velocity computation using a single view from a rolling shutter camera. In: ECCV. (2006)