Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge   Distillation for Unsupervised Monocular Depth Estimation

Andrea Pilzer; St\'ephane Lathuili\`ere; Nicu Sebe; Elisa Ricci

arXiv:1903.04202·cs.CV·April 23, 2019

Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge Distillation for Unsupervised Monocular Depth Estimation

Andrea Pilzer, St\'ephane Lathuili\`ere, Nicu Sebe, Elisa Ricci

PDF

TL;DR

This paper introduces a self-supervised monocular depth estimation model that leverages cycle-inconsistency and knowledge distillation, outperforming existing unsupervised methods on the KITTI benchmark.

Contribution

The novel framework combines cycle-inconsistency refinement with knowledge distillation to improve unsupervised depth estimation accuracy.

Findings

01

Outperforms state-of-the-art unsupervised methods on KITTI

02

Effective use of cycle-inconsistency for depth refinement

03

Knowledge distillation enhances model performance

Abstract

Nowadays, the majority of state of the art monocular depth estimation techniques are based on supervised deep learning models. However, collecting RGB images with associated depth maps is a very time consuming procedure. Therefore, recent works have proposed deep architectures for addressing the monocular depth prediction task as a reconstruction problem, thus avoiding the need of collecting ground-truth depth. Following these works, we propose a novel self-supervised deep model for estimating depth maps. Our framework exploits two main strategies: refinement via cycle-inconsistency and distillation. Specifically, first a \emph{student} network is trained to predict a disparity map such as to recover from a frame in a camera view the associated image in the opposite view. Then, a backward cycle network is applied to the generated image to re-synthesize back the input image, estimating…

Tables4

Table 1. Table 1: Ablation study on KITTI dataset using the training and testing split proposed by Eigen et al. [ 5 ] . The upper part shows the results with the multiscale reconstruction ℒ 1 subscript ℒ 1 \mathcal{L}_{1} loss in [ 13 ] , the bottom part with the ℒ 1 subscript ℒ 1 \mathcal{L}_{1} loss proposed in [ 12 ] .

[12] $ℒ_{1}$ loss
Method	Abs Rel	Sq Rel	RMSE	$R M S E_{l o g}$	$δ < 1.25$	$δ < {1.25}^{2}$	$δ < {1.25}^{3}$
Method	lower is better				higher is better
HC	0.1487	1.2942	5.800	0.246	0.805	0.925	0.965
C	0.1451	1.2943	5.850	0.242	0.796	0.924	0.967
T feat	0.1220	1.0433	5.321	0.229	0.834	0.933	0.968
T disp	0.1234	1.0509	5.283	0.228	0.834	0.934	0.968
S feat	0.1438	1.2806	5.834	0.241	0.797	0.926	0.968
S disp	0.1438	1.2551	5.771	0.238	0.797	0.927	0.969
T feat	0.1017	0.8930	4.768	0.206	0.878	0.946	0.972
T disp	0.0983	0.8306	4.656	0.202	0.882	0.948	0.973
S feat	0.1474	1.2416	5.849	0.241	0.788	0.923	0.968
S disp	0.1424	1.2306	5.785	0.239	0.795	0.924	0.968

Table 2. Table 2: Ablation study where our two-network cycle is replaced by the single-network cycle from Yang et al. [ 40 ] (referred as to 1-CN).

Method	Abs Rel	Sq Rel	RMSE	$R M S E_{l o g}$	$δ < 1.25$	$δ < {1.25}^{2}$	$δ < {1.25}^{3}$
Method	lower is better				higher is better
1-CN C	0.1533	1.3326	5.837	0.240	0.785	0.919	0.967
1-CN S disp	0.1503	1.2622	5.868	0.243	0.783	0.918	0.967
Ours S disp	0.1438	1.2551	5.771	0.238	0.797	0.927	0.969
1-CN T disp	0.1478	1.3609	5.952	0.243	0.793	0.921	0.966
Ours T disp	0.1234	1.0509	5.283	0.228	0.834	0.934	0.968

Table 3. Table 3: Ablation study on the Cityscapes dataset. The upper part shows the results with the multiscale reconstruction ℒ 1 subscript ℒ 1 \mathcal{L}_{1} loss in [ 13 ] , the bottom part with the ℒ 1 subscript ℒ 1 \mathcal{L}_{1} loss proposed in [ 12 ] .

[12] $ℒ_{1}$ loss
Method	Abs Rel	Sq Rel	RMSE	$R M S E_{l o g}$	$δ < 1.25$	$δ < {1.25}^{2}$	$δ < {1.25}^{3}$
Method	lower is better				higher is better
HC	0.4676	7.3992	5.741	0.493	0.735	0.890	0.945
C	0.4523	6.2604	5.381	0.557	0.736	0.888	0.946
T feat	0.4087	5.8777	4.394	0.334	0.846	0.940	0.967
T disp	0.3988	5.8752	4.293	0.316	0.848	0.941	0.968
S feat	0.4494	6.2599	5.343	0.421	0.739	0.891	0.947
S disp	0.4467	5.9012	5.297	0.473	0.736	0.890	0.946
T feat	0.3878	5.8190	4.123	0.397	0.861	0.945	0.969
T disp	0.3846	6.2007	4.476	0.318	0.864	0.945	0.969
S feat	0.4455	6.2748	5.366	0.468	0.739	0.891	0.946
S disp	0.4305	5.9552	5.281	0.519	0.740	0.891	0.946

Table 4. Table 4: Comparison with the state of the art. Training and testing are performed on the KITTI [ 10 ] dataset. Supervised and semi-supervised methods are marked with Y in the supervision (Sup.) column, unsupervised methods with N. Methods using a frame sequence in input and, thus, exploiting temporal information either at training or testing time, are marked with Y in the Video column. Numbers are obtained on Eigen [ 5 ] test split with Garg [ 8 ] image cropping. Depth predictions are capped at the common threshold of 80 meters, if capped at 50 meters we specify it. Best scores among static unsupervised methods are in bold. Best scores among other method categories are in italic.

Method	Sup	Video	Abs Rel	Sq Rel	RMSE	$R M S E_{l o g}$	$δ < 1.25$	$δ < {1.25}^{2}$	$δ < {1.25}^{3}$
Method	Sup	Video	lower is better				higher is better
Eigen et al. [5]	Y	N	0.190	1.515	7.156	0.270	0.692	0.899	0.967
Xu et al. [38]	Y	N	0.132	0.911	-	0.162	0.804	0.945	0.981
Jiang et al.[18]	Y	N	0.131	0.937	5.032	0.203	0.827	0.946	0.981
Gan et al. [7]	Y	N	0.098	0.666	3.933	0.173	0.890	0.964	0.985
Guo et al. [14]	Y	N	0.097	0.653	4.170	0.170	0.889	0.967	0.986
Yang et al. [40]	Y	Y	0.097	0.734	4.442	0.187	0.888	0.958	0.980
Zou et al.[45]	N	Y	0.150	1.124	5.507	0.223	0.806	0.933	0.973
Godard et al.[12]	N	Y	0.115	1.010	5.164	0.212	0.858	0.946	0.97
Zhou et al. [43]	N	N	0.208	1.768	6.856	0.283	0.678	0.885	0.957
Garg et al. [8]	N	N	0.169	1.08	5.104	0.273	0.740	0.904	0.962
Kundu et al. [19], 50m	N	N	0.203	1.734	6.251	0.284	0.687	0.899	0.958
Godard et al. [13]	N	N	0.148	1.344	5.927	0.247	0.803	0.922	0.964
Pilzer et al. [32]	N	N	0.152	1.388	6.016	0.247	0.789	0.918	0.965
Ours Student	N	N	0.1424	1.2306	5.785	0.239	0.795	0.924	0.968
Ours Teacher	N	N	0.0983	0.8306	4.656	0.202	0.882	0.948	0.973

Equations20

\hat{I}_{l} = f_{w} (d_{l}, I_{r}) .

\hat{I}_{l} = f_{w} (d_{l}, I_{r}) .

\hat{I}_{r} = f_{w} (d_{r}, \hat{I}_{l}) .

\hat{I}_{r} = f_{w} (d_{r}, \hat{I}_{l}) .

d_{r} = G_{b} (\hat{I}_{l})

d_{r} = G_{b} (\hat{I}_{l})

I_{r} = I_{r} - \hat{I}_{r}

I_{r} = I_{r} - \hat{I}_{r}

\hat{I}_{l}^{'} = f_{w} (d_{l}^{'}, I_{r})

\hat{I}_{l}^{'} = f_{w} (d_{l}^{'}, I_{r})

L_{r ec}^{(0)} = λ_{s} [α L_{S S I M} (\hat{I}_{l}, I_{l}) + (1 - α) ∣∣ \hat{I}_{l} - I_{l} ∣ ∣_{1}] + λ_{b} [α L_{S S I M} (\hat{I}_{r}, I_{r}) + (1 - α) ∣∣ \hat{I}_{r} - I_{r} ∣ ∣_{1}] + λ_{t} [α L_{S S I M} (\hat{I}_{l}^{'}, I_{l}) + (1 - α) ∣∣ \hat{I}_{l}^{'} - I_{l} ∣ ∣_{1}]

L_{r ec}^{(0)} = λ_{s} [α L_{S S I M} (\hat{I}_{l}, I_{l}) + (1 - α) ∣∣ \hat{I}_{l} - I_{l} ∣ ∣_{1}] + λ_{b} [α L_{S S I M} (\hat{I}_{r}, I_{r}) + (1 - α) ∣∣ \hat{I}_{r} - I_{r} ∣ ∣_{1}] + λ_{t} [α L_{S S I M} (\hat{I}_{l}^{'}, I_{l}) + (1 - α) ∣∣ \hat{I}_{l}^{'} - I_{l} ∣ ∣_{1}]

L_{r ec} = n = 0 \sum 4 L_{r ec}^{(n)}

L_{r ec} = n = 0 \sum 4 L_{r ec}^{(n)}

L_{d i s t} = ∣∣ d_{l} - S (d_{l}^{'}) ∣ ∣_{1}

L_{d i s t} = ∣∣ d_{l} - S (d_{l}^{'}) ∣ ∣_{1}

L_{d i s t} = ∣∣ ξ_{r}^{n} - S (ξ_{r}^{' n}) ∣ ∣_{2}

L_{d i s t} = ∣∣ ξ_{r}^{n} - S (ξ_{r}^{' n}) ∣ ∣_{2}

L_{t o t} = L_{r ec} + λ_{d i s t} L_{d i s t}

L_{t o t} = L_{r ec} + λ_{d i s t} L_{d i s t}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsKnowledge Distillation

Full text

Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge

Distillation for Unsupervised Monocular Depth Estimation

Andrea Pilzer1, Stéphane Lathuilière1, Nicu Sebe1,2, and Elisa Ricci1,3

1DISI, University of Trento, via Sommarive 14, Povo (TN), Italy

2Huawei Technologies Ireland, Dublin, Ireland

3Technologies of Vision, Fondazione Bruno Kessler, via Sommarive 18, Povo (TN), Italy

{andrea.pilzer, stephane.lathuiliere, niculae.sebe, e.ricci}@unitn.it

Abstract

Nowadays, the majority of state of the art monocular depth estimation techniques are based on supervised deep learning models. However, collecting RGB images with associated depth maps is a very time consuming procedure. Therefore, recent works have proposed deep architectures for addressing the monocular depth prediction task as a reconstruction problem, thus avoiding the need of collecting ground-truth depth. Following these works, we propose a novel self-supervised deep model for estimating depth maps. Our framework exploits two main strategies: refinement via cycle-inconsistency and distillation. Specifically, first a student network is trained to predict a disparity map such as to recover from a frame in a camera view the associated image in the opposite view. Then, a backward cycle network is applied to the generated image to re-synthesize back the input image, estimating the opposite disparity. A third network exploits the inconsistency between the original and the reconstructed input frame in order to output a refined depth map. Finally, knowledge distillation is exploited, such as to transfer information from the refinement network to the student. Our extensive experimental evaluation demonstrate the effectiveness of the proposed framework which outperforms state of the art unsupervised methods on the KITTI benchmark.

1 Introduction

In the last few years, deep learning-based approaches for depth estimation [5, 21, 25, 38, 13, 43, 28, 32] have attracted a growing interest, motivated, on the one hand, by their ability to predict very accurate depth maps and, on the other hand, by the importance of recovering depth information in several applications, such as robot navigation, autonomous driving, virtual reality and 3D reconstruction.

Exploiting the availability of very large annotated datasets, Convolutional Neural Networks (ConvNets) trained in a supervised setting are now state-of-the-art in many computer vision tasks such as object detection [11], instance segmentation [31], human pose estimation [30]. However, a major weakness of these approaches is the need of collecting large-scale labeled datasets. In the case of depth estimation, acquiring data is especially costly. For instance, in the scenario of depth estimation for autonomous driving, it implies driving a car equipped with a laser LiDaR scanner for hours under diverse lighting and weather conditions. Self-supervised depth estimation, also referred to as unsupervised, recently emerged as an interesting paradigm and an effective alternative to supervised methods [27, 8, 29, 13, 34]. Roughly speaking, in the self-supervised setting, stereo image pairs are considered as input and a deep predictor is learned in order to estimate the associated disparity maps. Specifically, the predicted disparity is employed to synthesize, from a frame in a camera view (e.g. from the left camera), the opposite view through warping. The deep network is trained via gradient descent by minimizing the discrepancy between the original and the reconstructed image. Importantly, even if stereo images pairs are required for training, depth can be recovered from a single image at test time.

In this paper, we follow this research thread and propose a novel self-supervised deep architecture for monocular depth estimation. The proposed approach, illustrated in Fig 1, consists of a first sub-network, referred to as the student network, which receives as input an image from a camera view and predicts a disparity map such as to recover the opposite view. On top of this network, we propose several contributions. First, from the generated image, we propose to re-synthesize the input image by estimating the opposite disparity. The resulting network forms a cycle. Second, a third network exploits the cycle inconsistency between the original and the reconstructed input images in order to refine the estimated depth maps. Our intuition is that inconsistency maps provide rich information which can be further exploited, as they indicate where the first two networks fail to predict disparity pixels. Finally, we propose to use the principle of distillation in order to transfer knowledge from the whole network, seen as a teacher, to the student network. Interestingly, our framework produce two outputs, corresponding to the depth maps estimated respectively by the student and the teacher networks. This is extremely relevant in practical applications, as the student network can be exploited in case of low computation power or real-time constraints.

Our extensive experiments on two large publicly available datasets, i.e. the KITTI [9] and the Cityscapes [2] datasets, demonstrate the effectiveness of the proposed framework. Notably, by combining the proposed cycle structure with our inconsistency-aware refinement, our unsupervised framework outperforms previous usupervised approaches, while obtaining comparable results with the state-of-the-art supervised methods on the KITTI dataset.

2 Related Work

In the last decade, deep learning models have greatly improved the performance of depth estimation methods. The vast majority of methods focus on a supervised setting and the problem of predicting depth maps is cast as a pixel-level regression problem [5, 25, 44, 22, 38, 40, 6]. The first ConvNet approach for monocular depth prediction was proposed in Eigen et al. [5], where the benefit of considering both local and global information was demonstrated. More recent works improved the performance of deep models by exploiting probabilistic graphical models implemented as neural networks [25, 36, 39, 38]. For instance, Wang et al. [36] proposed integrating hierarchical Conditional Random Fields (CRFs) into a ConvNet for joint depth estimation and semantic segmentation. Xu et al. [39, 38] exploited CRFs within a deep architecture in order to fuse information at multiple scales. However, supervised approaches rely on expensive ground-truth annotations and, consequently, lack flexibility for deployment in novel environments.

Recently, several works proposed to tackle the depth estimation problem within an unsupervised learning framework [20, 28, 34, 42, 32]. For instance, Garg et al. [8] attempted to learn depth maps in an indirect way. They used a ConvNet to predict the right-to-left disparity map from the left image and then reconstructed the right image according to the predicted disparity. They also introduced a network architecture operating based on a coarse-to-fine principle, i.e. they employed an encoder-decoder network where the decoder first estimates a low resolution disparity map and then refines it in order to obtain a map at higher resolution. Improving upon [8], Godard et al. [13] proposed to use a single generative network to estimate both the left-to-right and the right-to-left disparity maps. Consistency between the two disparities was exploited in form of a loss in order to better constrain the model. Other recent works demonstrated that temporal information and, in particular, considering multiple consecutive frames contribute to improve depth estimation [35, 41, 12, 43]. In particular, Zhou et al. [43] exploited temporal information to jointly learn the depth and the camera ego-motion from monocular sequences. Similarly, in [12], a deep network was designed in order to estimate both the depth and the camera pose from three consecutive frames. In this paper we focus on improving frame-level unsupervised depth estimation and we do not exploit any additional information such as supervision from related tasks (*e.g. *ego-motion estimation) or temporal consistency. In this respect, our work can be regarded as complementary to [43, 12].

The idea of exploiting cycle-consistency for depth estimation was recently investigated in [32]. Specifically, Pilzer et al. [32] introduced a deep architecture for stereo depth estimation which is organized in form of a cycle: two sub-networks, corresponding to the two half-cycles, estimate respectively the left-to-right and right-to-left disparities. They also showed that cycle consistency, together with an adversarial loss, can greatly improve the quality of the predicted depth maps. The main difference with our proposal is that the architecture in [32] is designed for stereo depth estimation whereas we focus on the monocular setting. Moreover, contrary to [32], our architecture exploits cycle inconsistency both at training and at test time. Simultaneously, Tosi et al. [33] proposed disparity refinement and Yang et al. [40] proposed to compute the error maps between the original input images and their cycle-reconstructed versions and considered them as an additional input to a second network which produces refined depth estimates. Opposite to our approach, the deep model in [40] is trained using supervision derived by Stereo Direct Sparse Odometry [37]. Furthermore, to construct the cycle, we exploit a backward network and introduce a distillation loss.

Recently, knowledge distillation attracted a lot of attention [16]. This methodology consists in compressing a large deep network (usually referred to as the teacher) into a much smaller model (student) operating on the same modality. The student network is trained such that its outputs match those of the teacher. Knowledge distillation has been exploited for many computer vision tasks such as domain adaptation [15], object detection [1], learning from noisy labels [24] or facial analysis [26]. However, to the best of our knowledge, this work is the first attempt to exploit distillation for depth estimation. We claim that distillation is especially relevant for depth estimation since, in practical applications such as autonomous driving, real-time constraints may impose limitations in term of network size. Note that, we employ an unusual distillation scenario in which the student network is a sub-network of the teacher.

3 Proposed Method

3.1 Overview

The aim of this work is to estimate the depth of a scene from a single image. However, at training time, we consider that we dispose of pairs of images $\{\mathbf{I}_{l},\mathbf{I}_{r}\}$ of size $H\times W$ , derived from a stereo pair and corresponding to the same time instant. Here, $\mathbf{I}_{l}$ denotes the left camera view and $\mathbf{I}_{r}$ is the right camera view. Given $\mathbf{I}_{r}$ , we are interested in predicting a correspondence map $\mathbf{d}_{l}\in\mathbb{R}^{H\times W}$ , namely the right-to-left disparity, in which each pixel value represents the offset of the corresponding pixel between the right and the left images. Finally, assuming that the images are rectified, the depth at a pixel location $(x,y)$ of the left image can be recovered from the predicted disparity with $d_{l}=\frac{f.b}{d(x,y)}$ , where $b$ is the distance between the two cameras and $f$ is the camera focal length.

An overview of the proposed framework is shown in Fig. 2. A first network $G_{s}$ predicts the right-to-left disparity map $\mathbf{d}_{l}$ from the right image $\mathbf{I}_{r}$ , and synthesizes the left image by warping $\mathbf{I}_{r}$ according to $\mathbf{d}_{l}$ . Roughly speaking, the network $G_{s}$ is trained to minimize the discrepancy between the real and the reconstructed left image (Sec. 3.4).

We employ a second generator network $G_{b}$ that takes as input the synthesized left image and predicts a left-to-right disparity map $\mathbf{d}_{r}$ that is used to re-synthesize the right image. The model obtained in this way forms a cycle. This cycle design has three advantages. First, at training time, by sharing weights between $G_{s}$ and $G_{b}$ , the networks learn to predict disparity maps from the images of the training set (in the forward half-cycle $G_{s}$ ) but also from the synthesized images (in the backward half-cycle $G_{b}$ ). In that sense, the use of the cycle can be seen as a sort of data augmentation. Second, in order to re-synthesize correctly the right image, the second network $G_{b}$ requires a correct input left image. Thus, $G_{b}$ imposes a global constraint on the estimated disparity $\mathbf{d}_{l}$ oppositely to standard pixel-wise discrepancy losses, such as $\mathcal{L}_{1}$ or $\mathcal{L}_{2}$ that act only locally. Third, by comparing the input right image $\mathbf{I}_{r}$ and the output right image $\hat{\mathbf{I}_{r}}$ synthesized after applying our cycle framework, we can measure the cycle inconsistency. At a given location of the input image, if we observe no inconsistency, $G_{s}$ and $G_{b}$ must have predicted correctly the disparity maps. Conversely, in case of inconsistency, $G_{s}$ or $G_{b}$ (or both) must have predicted incorrectly the disparity maps. Note that inconsistencies may also appear on objects regions that are visible in only one of the two views. Interestingly, these regions are usually located on the object edges. Therefore, looking at cycle inconsistency also provides information about object edges that can help to predict better depth maps. Importantly, this inconsistency can be measured both at training and testing times, even if at testing time, we dispose only of the right image.

The main contribution of this work consists in exploiting the cycle inconsistency by training a third network in order to improve the prediction performance and output a refined depth map $\mathbf{d}_{l}^{\prime}$ . In addition, since employing our inconsistency-aware network leads to more accurate depth predictions, we propose to use the disparity maps predicted by $G_{i}$ in order to improve $G_{s}$ training via a knowledge distillation approach.

Note that, another possible cycle approach, as proposed in [40], would consist in using a single network to predict the two disparity maps. The two disparities can be used to obtain the synthesized left image and then the re-synthesized right image. Nevertheless, this approach has a major disadvantage with respect to our approach, i.e., since only the warping operator in employed between the two synthesized images, and consequently the receptive field of $\hat{\mathbf{I}_{r}}$ in $\mathbf{I}_{l}$ is very small. In particular, when implementing the warping operator via bilinear sampling, the receptive field of the warping operator in only $2\times 2$ . Therefore, the right image reconstruction loss can act on the reconstructed left image only locally. Conversely, our backward network $G_{b}$ imposes a global consistency on $\mathbf{d}_{l}$ thanks to its large receptive field.

The outputs of our method correspond to the estimated depth maps $\mathbf{d}_{l}$ and $\mathbf{d}_{l}^{\prime}$ . While the estimated depth $\mathbf{d}_{l}^{\prime}$ corresponding to the teacher model is typically more accurate, in some applications, *e.g. *in resource-constrained settings, it could be convenient to exploit only a small student network.

In the following, we describe the design of our cycled network. Then, we introduce our novel inconsistency-aware network. Finally, we present the optimization objective including our proposed distillation approach.

3.2 Unsupervised Monocular Cycled Network

In this work, we adopt a setting in which the model is trained without the need of ground truth depth maps. This approach is often referred to as unsupervised or self-supervised depth estimation. Roughly speaking, it consists in training a network to predict a disparity map that can be used to generate the left image from the right image. Formally speaking, we employ a first network $G_{s}$ that takes as input the right image $\mathbf{I}_{r}$ and predicts the right-to-left disparity $\mathbf{d}_{l}$ . Following [13], we adopt a U-Net architecture for $G_{s}$ . We employ a warping function $f_{w}(\cdot)$ that synthesizes the left view image by sampling from $\mathbf{I}_{r}$ according to $\mathbf{d}_{l}$ :

[TABLE]

Importantly, $f_{w}(\cdot)$ is implemented using the bilinear sampler from the spatial transformer network [17] resulting in a fully differentiable model. Consequently, the network can be trained via gradient descent by minimizing the discrepancy between $\hat{\mathbf{I}}_{l}$ and $\mathbf{I}_{l}$ (see Sec. 3.4 for details about network training).

Inspired by [32], we employ a second network $G_{b}$ in order to re-synthesize the right image according to:

[TABLE]

where:

[TABLE]

The $G_{b}$ and $G_{s}$ networks share their encoder parameters. Note that, differently from the stereo depth model proposed in [32], our second half-cycle network takes only the synthesized left image as input. This crucial difference allows the use of this cycle in the monocular setting at testing time. Concerning the decoder networks, we adopt an architecture composed of a sequence of up-convolution layers in which the disparity is estimated and gradually refined from low to full resolutions similarly to [13]. We obtain the estimated left and the right disparity maps at each scale $\mathbf{d}_{l}^{n}$ and $\mathbf{d}_{r}^{n}$ , $n\in\{0,1,2,3\}$ , with sizes $[H/2^{n},W/2^{n}]$ . More precisely, $\mathbf{d}_{r}^{n}$ is computed from the decoder feature map $\mathbf{\xi_{r}}^{n}$ of size $[H/2^{n},W/2^{n}]$ via a convolutional layer. Then, $\mathbf{d}_{r}^{n}$ is concatenated with $\mathbf{\xi_{r}}^{n}$ obtaining a tensor that is input to an up-convolution layer in order to estimate the disparity at the next resolution $\mathbf{d}_{r}^{n-1}$ .

3.3 Inconsistency-Aware Network

We define the inconsistency tensor as the difference between the input image $\mathbf{I}_{r}$ and the image $\hat{\mathbf{I}}_{r}$ predicted by the backward network $G_{b}$ :

[TABLE]

The proposed inconsistency-aware network $G_{i}$ takes as input the concatenation of $\mathbf{I}_{r}$ , $\mathcal{I}_{r}$ and $\mathbf{d}_{l}$ . We employ a network architecture similar to the half-cycle monocular network described in Sec. 3.2. However, we propose to provide to the encoder network the disparity maps $\mathbf{d}_{l}^{n},n\in\{1,2,3\}$ estimated by $G_{s}$ at each scale. More precisely, we concatenate along the channel axis each disparity $\mathbf{d}_{l}^{n}$ with network features of corresponding dimensionality.

The inconsistency-aware network $G_{i}$ estimates the right-to-left disparity $\mathbf{d}_{l}^{\prime}=G_{i}(\mathbf{I}_{r},\mathcal{I}_{r},\mathbf{d}_{l},\mathbf{d}_{l}^{\{1,2,3\}})$ and we reconstruct the left view image $\hat{\mathbf{I}_{l}}^{\prime}$ by applying the warping function $f_{w}$ :

[TABLE]

Similarly to $G_{s}$ and $G_{b}$ , $G_{i}$ estimates low resolution disparity maps $\mathbf{d_{l}^{\prime}}^{n},n\in\{1,2,3\}$ that are gradually refined from low to full resolutions.

3.4 Network Training and Knowledge Self-Distillation

In this section, we detail the losses employed to train the proposed network in an end-to-end fashion.

Reconstruction. First, we employ a reconstruction and stucture similarity loss for each network. Following [13], we adopt the $\mathcal{L}_{1}$ loss to measure the discrepancy between the synthesized and the real images and the structure similarity loss $\mathcal{L}_{SSIM}$ to measure the discrepancy between the synthesized and the real images structure. By summing the losses of the three networks $G_{s}$ , $G_{b}$ and $G_{i}$ , we obtain:

[TABLE]

where $\lambda_{s}$ , $\lambda_{b}$ and $\lambda_{t}$ are adjustment parameters and $\alpha=0.85$ . Similarly, we also compute a reconstruction loss $\mathcal{L}_{rec}^{(n)}$ for the low resolution disparity maps. Following [12], we upsample the low resolution $\mathbf{d}_{l}^{n}$ , $\mathbf{d}_{r}^{n}$ and $\mathbf{d_{l}^{\prime}}^{n}$ to $H\times W$ and use the warping operator $f_{w}$ to re-synthesize full resolution images that are compared with the real images according to the $\mathcal{L}_{1}$ loss. The total reconstruction loss is:

[TABLE]

Self-Distillation. Finally, we propose to introduce a knowledge distillation loss. As detailed in the experimental section (Sec 4), the inconsistency-aware network outperforms by a significant margin the simple half-cycle network $G_{s}$ . This boost is at the cost of a higher computation complexity. The idea of the proposed self-distillation loss consists in distilling knowledge from inconsistency-aware network to the half-cycle network $G_{s}$ . Thus, we improve the performance of $G_{s}$ without adding any computation complexity at testing time. To do so, we evaluate disparity and feature distillation. For the first, we impose that the network $G_{d}$ predicts disparity maps similar to the output of inconsistency-aware network. It can be seen as a distillation approach where $G_{s}$ plays the role of the student and the whole network (composed of $G_{s}$ , $G_{b}$ and $G_{i}$ ) is the teacher. However, in our particular case, the student network is a sub-network of the teacher. From this perspective, we name this approach self-distillation. The self-distillation loss is given by:

[TABLE]

where $\mathcal{S}$ denotes the stop-gradient operation. In particular, the stop-gradient operation equals the identity function when computing the forward pass of the back-propagation algorithm but it has a null gradient when computing the backward pass. The purpose of the stop-gradient is to avoid that $\mathbf{d}_{l}^{\prime}$ converges to $\mathbf{d}_{l}$ . On the contrary, the goal is to help $\mathbf{d}_{l}$ to become as accurate as $\mathbf{d}_{l}^{\prime}$ .

For the second, we impose that the decoder features $\xi_{r}^{\prime n},n\in{0,1,2}$ of the teacher are similar to the features $\xi_{r}^{n}$ of the student. The self-distillation loss is given by:

[TABLE]

The total training loss is given by:

[TABLE]

4 Experiments

We evaluate our proposed approach on two publicly available datasets and compare its performance with state of the art methods.

4.1 Experimental Setup

Datasets. We perform experiments on two large stereo images datasets, i.e. KITTI [10] and Cityscapes [3]. Both datasets are recorded from driving vehicles. Concerning the KITTI dataset, we employ the training and test split of Eigen et al. [5]. This split is composed of 22,600 training image pairs, and 697 test pairs. We consider data-augmentation with online random flipping of the images during training as in [13]. For Cityscapes, images were collected with higher resolution. To train our model we combine images from the densely and coarse annotated splits to obtain 22,973 image-pairs as in [32]. The test split is composed of 1,525 image-pairs of the densely annotated split. The evaluation is performed using the pre-computed disparity maps.

Evaluation Metrics. The quantitative evaluation is performed according to several standard metrics used in previous works [5, 13, 36]. Let $P$ be the total number of pixels in the test set and $\hat{d}_{i}$ , $d_{i}$ the estimated depth and ground truth depth values for pixel $i$ . We compute the following metrics:

ean relative error (abs rel): $\frac{1}{P}\sum_{i=1}^{P}\frac{\parallel\hat{d}_{i}-d_{i}\parallel}{d_{i}}$ ,
•

Squared relative error (sq rel): $\frac{1}{P}\sum_{i=1}^{P}\frac{\parallel\hat{d}_{i}-d_{i}\parallel^{2}}{d_{i}}$ ,

•

Root mean squared error (rmse): $\sqrt{\frac{1}{P}\sum_{i=1}^{P}(\hat{d}_{i}-d_{i})^{2}}$ ,

•

Mean $\log 10$ error (rmse log): $\sqrt{\frac{1}{P}\sum_{i=1}^{P}\parallel\log\hat{d}_{i}-\log d_{i}\parallel^{2}}$

•

Accuracy with threshold $\tau$ , *i.e.*the percentage of $\hat{d}_{i}$ such that $\delta=\max(\frac{d_{i}}{\hat{d}_{i}},\frac{\hat{d}_{i}}{d_{i}})<\alpha^{\tau}$ . We employ $\alpha=1.25$ and $\tau\in[1,2,3]$ following [5].

4.2 Baselines for Ablation.

To perform the ablation study presented in Sec.4.3, we consider the following baselines:

•

half-cycle: our basic building block, uses the forward branch that takes $\mathbf{I}_{r}$ as input and generates $\mathbf{d}_{l}$ to reconstruct the other stereo view $\mathbf{\hat{I}}_{l}$ . Neither cycle-consistency nor self-distillation are used in this model.

•

cycle: a backward network is added to the half-cycle model in order to reconstruct $\mathbf{\hat{I}}_{r}$ from the estimated $\mathbf{\hat{I}}_{l}$ . Note that the backward network is used only at training time. At test time, the output is the same as for the half-cycle model.

•

teacher, we stack the inconsistency-aware network after the cycle as described in Sec 3.3.

•

student: the output of the inconsistency-aware network is distilled in order to refine the first half-cycle. At test time, the output and the computation complexity are the same as in the half-cycle model.

In Tables 1, 2 and 3 we indicate with HC, C, T and S, the half-cycle, cycle, teacher and student respectively; feat and disp denote self-distillations of features and disparities.

Training Procedure. The whole network is trained following an iterative procedure. First, we start by training the forward half-cycle network for 10 epochs. In a second step, we train the backward network decoder for 5 epochs without updating the first half-cycle network. The whole cycle is then jointly trained for further 10 epochs. Then, the inconsistency-aware module is pretrained for 5 epochs. Finally, the whole network is jointly fine-tuned for 10 epochs.

Parameters. The model is implemented with the deep learning library TensorFlow. Similarly to [13], the input images are down-sampled to a resolution of $512\times 256$ from the original sizes which are $1226\times 370$ for the KITTI dataset and for CityScapes. In all our experiments we use a batch size equal to $8$ stereo image pairs and the Adam optimizer with learning rate set to $10^{-5}$ .

The half-cycle and cycle networks are trained with the following loss parameters $\lambda_{s}=1$ , $\lambda_{b}=0.1$ and $\lambda_{t}=0$ . When training the teacher network we use $\lambda_{s}=0$ , $\lambda_{b}=0$ and $\lambda_{t}=1$ . We weight the distillation loss $\mathcal{L}_{dist}$ with $\lambda_{dist}=0.005$ and $\lambda_{dist}=0.1$ respectively, if feature distillation or disparity distillation is applied. The joint training of the full network is done with learning rate $l_{r}=10^{-5}$ , loss parameters $\lambda_{s}=1$ , $\lambda_{b}=0.1$ , $\lambda_{t}=1$ and $\lambda_{dist}$ equal to $0.005$ in the case feature distillation and $0.1$ in the case of disparity distillation, respectively.

4.3 Results

Ablation Study. To demonstrate the validity of the proposed contributions we first conduct an ablation study on the KITTI dataset [10] and the CityScapes dataset [3]. Results are shown in Table 1 and Table 3, respectively.

We split the ablation in two parts where we employ two different reconstruction loss variants. For the first part, as in [13], we use a multi-scale reconstruction loss where the smaller scale reconstruction is compared with a downsampled version of the stereo image. In contrast with that, for the second part, we employ a more effective reconstruction loss, upsampling to input scale all the disparities before warping as described in Sec. 3.4.

In Table 1 it is interesting to note that our intuition of self-constraining the monocular student network with cycled design improves, without requiring additional losses, in several of the metrics compared to the simple forward branch. This comes at the cost of doubling the forward propagation time at training but not at testing time. Moreover, the monocular cycled structure has the big advantage of automatically computing the inconsistency of the reconstruction both at training and testing time. Therefore, stacking a network aware of the inconsistencies and previous estimations, the teacher network, improves the performance. We observe that our proposed inconsistency-aware network brings an important improvement consistent over all the metrics, e.g. $14\%$ and $18\%$ in Abs Rel and Sq Rel, respectively, comparing cycle and teacher.

Student-teacher distillation leads to a consistent improvement over all metrics, demonstrating that self-distillation improves the student, while keeping the performance of teacher constant. Regarding the two distillation strategies, we found that network with disparity distillation converges faster than that with the feature distillation. This is not unexpected, given the much more compact size of the disparity compared to the several channels of the features.

For demonstrating the validity of the design of our cycle network, we perform an ablation study where our two-network cycle structure is replaced by the single-network cycle proposed by Yang et al. [40]. In this experiment, we use our proposed inconsistency-aware module to exploit the inconsistency estimated by the single network cycle in [40]. Contrary to [40], we trained the models without supervision in order to compare the two different approaches in the unsupervised setting. We use the $\mathcal{L}_{1}$ loss from [13] for fair comparison. Results are reported in Table 2. We observe that the inconsistency estimates obtained with the single-network cycle of [40] are associated with worse performance with respect to those of our method.

We also performed an ablation study on the Cityscapes dataset in Table 3, following the evaluation procedure proposed in [32]. The results confirm the trends observed on KITTI. The cycle network improves over the half-cycle in five metrics out of seven. The teacher, effectively exploiting inconsistencies, is associated with an improvement on all error metrics (ranging from $7\%$ to 20 $\%$ ). Distillation further provides a boost in performance of about $1.5\%$ to $5\%$ . In the second part of the ablation study, the teacher further improves its estimations gaining over $20\%$ over the initial cycle setting. More interesting is the gain in performance of the student that improves from $2\%$ to $5\%$ .

In Fig. 4, we present qualitative results for Cityscapes. half-cycle and cycle images are smooth and do not present artifacts. The teacher provides more accurate depth maps with sharper edges for small objects and better background estimations (*e.g. *third row, people in the back). After distillation also the student inherits this ability and we observe more detailed predictions compared to the original cycle.

4.4 Comparison with State-of-the-Art

In Table 4 we compare with several state-of-the-art works, considering both supervised learning-based ( Eigen et al. [5], Xu et al. [38], Jiang et al. [18], Gan et al. [7], Guo et al. [14], Yang et al. [40]) and unsupervised learning-based (Zhou et al. [43], Garg et al. [8], Kundu et al. [19], Godard et al. [13], Pilzer et al. [32], Godard et al. [12] and Zou et al.[45]) methods.

The teacher network reaches state-of-the-art performance for the frame-level unsupervised setting, even improving over the state-of-the-art method that use depth supervision as [38], and is competitive with those using depth and video clues [7, 14, 40]. Note that Yang et al. [40] consider a similar setting to ours proposing to use errors to refine the depth estimation with a stacked network. Our method has several advantages though: it is unsupervised, it does not consider multiple video frames and it avoids the use of several losses whose hyper-parameters are hard to tune. Furthermore, as demonstrated by our experiments in Table 2, our approach adopts a more effective network structure for computing cycle inconsistencies. The student network, after distillation, improves on unsupervised approaches with similar network capacity like [8, 13, 32] and it is only outperformed by previous unsupervised methods that exploit additional information during training like [12].

Qualitative results in Figure 3 show that our model predicts more accurately challenging areas, i.e. sky, trees in background and shadowed areas difficult to interpret, compared to competitive unsupervised models [8, 13, 32]. Note that small details are better reconstructed by [13] but, overall, our estimations look smoother and have fewer large errors, as the train windshield in row seven.

5 Conclusions

We proposed a monocular depth estimation network which computes the inconsistencies between input and cycle-reconstructed images and exploit them to generate state-of-the-art depth predictions through a refinement network. We proved that distillation is an effective paradigm for depth estimation and improve the student network performance by transferring information from the refinement network. In future work we plan to further improve the distillation process by accounting for teacher and student confidence in the estimates. In this way we expect to better guide the learning process and correct more effectively prediction inconsistencies.

6 Acknowledgement

We want to thank the NVIDIA Corporation for the dona- tion of the GPUs used in this project.

Appendix

We report some implementation details and report further experimental results. Note that, qualitative results are also reported in the video file attached to this document.

Appendix A Training Details

In all our experiments, we use a learning rate equal to 1e-5 and batches composed of $8$ stereo image pairs. We employ the Adam optimizer, with momentum parameter and the weight decay set to $0.9$ and 2e-5, respectively. We used an NVIDIA Titan Xp with 12 GB of memory.

Analysis of Time Aspect. The initial training of the half-cycle for $10$ epochs takes approximately $2.7$ hours, and the backward-cycle decoder for $5$ epochs takes $1$ hour. Joint training of the cycle requires $3.5$ hours for $10$ epochs. Then, for the inconsistency-network $2$ hours for $5$ epochs. Finally, the joint fine tuning with self-distillation for $10$ epochs requires about $6.5$ hours.

At testing time, depending on time constraints, the student or teacher network can be used. The student takes $25$ ms while the teacher, that requires propagation through the full network, $48.5$ ms.

Appendix B Experimental Results

In this section, we present additional qualitative results, an ablation study of our proposed method on KITTI dataset [10], and visualizations of the inconsistency.

In Fig. 5, we report a qualitative ablation study on the KITTI dataset. These results are consistent with the qualitative ablation study on Cityscapes and with the quantitative ablation on KITTI both reported in the main paper. Indeed, we first observe that our teacher network estimates better the scene details, *e.g. *rows 1,3,4,6 and 8 where the image contains many trees and cars. For instance, in the first row, the depth of bicycle is not correctly estimated by our half-cycle. The image in row 4 is a particularly interesting example since the image is challenging due to the presence of many vehicles. Again, we observe that our inconsistency-aware network (referred to as teacher) predicts better depth maps.

In order to further analyze the performance of our model, in Fig. 6, we compare the inconsistency tensor, estimated by the cycle network, with the reconstruction errors of the student and teacher networks. First, we observe that the inconsistency tensors, column $4$ , are really similar to the reconstruction errors of the student, column $3$ . It shows that our cycle approach is able to estimate correctly the location of the errors in the student predictions. Second, most of the errors are located on the object edges. It confirms that, the cycle inconsistency can provide information about edge location. Third, comparing the reconstruction errors of the student, column $3$ , with the teacher’s reconstruction errors, column $6$ , we observe that the teacher’s error maps contain much fewer large errors. For instance, in row 4, the student network generates large errors on the edges of the car in the image center. Those errors are also visible on the inconsistency maps but are much smaller in the teacher prediction. This better estimation of the car edges can be also observed by comparing the depth maps predicted by the student and the teacher. In row 7 and in the last two rows, the student network generates errors on the dash lines on the road. These errors are also visible in the inconsistency tensors but are substantially reduced in the teacher predictions. These examples clearly illustrate the benefit of our inconsistency-aware network. Finally, in rows 1,2,3,5,9,10,12 and 15, we note that the student generates many errors when the input image contains trees. The teacher predictions are consistently better in the image regions containing trees.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. Learning efficient object detection models with knowledge distillation. In NIPS , 2017.
2[2] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR , 2016.
3[3] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR , 2016.
4[4] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV , 2015.
5[5] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS , 2014.
6[6] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In CVPR , 2018.
7[7] Yukang Gan, Xiangyu Xu, Wenxiu Sun, and Liang Lin. Monocular depth estimation with affinity, vertical pooling, and label enhancement. In ECCV , 2018.
8[8] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV , 2016.