Learning to Reconstruct and Understand Indoor Scenes from Sparse Views

Jingyu Yang; Ji Xu; Kun Li; Yu-Kun Lai; Huanjing Yue; Jianzhi Lu; Hao; Wu; Yebin Liu

arXiv:1906.07892·cs.CV·June 20, 2019

Learning to Reconstruct and Understand Indoor Scenes from Sparse Views

Jingyu Yang, Ji Xu, Kun Li, Yu-Kun Lai, Huanjing Yue, Jianzhi Lu, Hao, Wu, Yebin Liu

PDF

Open Access

TL;DR

This paper introduces a novel method for 3D indoor scene reconstruction and semantic segmentation from only a few uncalibrated color images, using an iterative deep architecture and a joint registration approach.

Contribution

It presents a new approach that reconstructs detailed 3D scenes and segments semantics from sparse, uncalibrated views, reducing data acquisition complexity.

Findings

01

Achieves more accurate depth estimation than existing methods.

02

Provides smaller semantic segmentation errors.

03

Produces better 3D reconstruction results.

Abstract

This paper proposes a new method for simultaneous 3D reconstruction and semantic segmentation of indoor scenes. Unlike existing methods that require recording a video using a color camera and/or a depth camera, our method only needs a small number of (e.g., 3-5) color images from uncalibrated sparse views as input, which greatly simplifies data acquisition and extends applicable scenarios. Since different views have limited overlaps, our method allows a single image as input to discern the depth and semantic information of the scene. The key issue is how to recover relatively accurate depth from single images and reconstruct a 3D scene by fusing very few depth maps. To address this problem, we first design an iterative deep architecture, IterNet, that estimates depth and semantic segmentation alternately, so that they benefit each other. To deal with the little overlap and non-rigid…

Tables7

Table 1. Table I: Comparison between various indoor datasets. IterNet RGB-D is our proposed dataset. × \times : not included, ✓: included, -: relevant information not available.

Dataset

NYUv2

[39]

SUN

RGB-D [48]

Building

Parser [2]

Matterport

3D [4]

ScanNet

[7]

SUNCG

[49]

SceneNet

RGB-D [17]

IterNet

RGB-D

Year

2012

2015

2017

2016

2019

Type

Real

Synthetic

Images/Scans

1449

10K

70K

194K

1513

130K

5M

12,856

Layouts

464

-

270

90

1513

45,622

57

3214

Object Classes

894

800

13

40

\geq

50

84

255

333

RGB

✓

\times

\times

✓

Depth

✓

\times

✓

Semantic Label

✓

RGB Texturing

Real

Not Photorealistic

Photorealistic

Image

Resolution

640

\times

480

640

\times

480

1080

\times

1080

1280

\times

1024

640

\times

480

640

\times

480

320

\times

240

1280

\times

960;

1280

\times

720

Table 2. Table II: Ablation study on our dataset. F-S: full model without semantic; F-D: full model without depth; F: full model.

Method	F-S	F-D	F
rel (lower is better)	0.176	-	0.136
log10 (lower is better)	0.088	-	0.062
rms (lower is better)	1.012	-	0.507
P-acc.(%) (higher is better)	-	67.35	75.54
M-acc.(%) (higher is better)	-	68.29	74.49
IoU(%) (higher is better)	-	54.21	63.98

Table 3. Table III: Quantitative evaluation for depth estimation on NYUv2 dataset.

Method	Error (lower is better)			Accuracy (higher is better)
Method	rel	log10	rms	$δ < 1.25$	$δ < {1.25}^{2}$	$δ < {1.25}^{3}$
Saxena et al.[44]	0.349	-	1.214	0.447	0.745	0.897
Liu et al.[32]	0.335	0.127	1.06	-	-	-
Karsch et al.[23]	0.35	0.131	1.20	-	-	-
Ladicky et al.[25]	-	-	-	0.542	0.829	0.941
Zhou et al.[58]	0.305	0.122	1.04	0.525	0.838	0.962
Liu et al.[31]	0.213	0.087	0.759	0.650	0.906	0.976
Roi and Todorovic [42]	0.187	0.078	0.744	-	-	-
Eigen et al.[13]	0.215	-	0.907	0.611	0.887	0.971
Eigen and Fergus [12]	0.158	-	0.641	0.769	0.950	0.988
Laina et al.[26]	0.129	0.056	0.583	0.801	0.950	0.986
Xu et al.[54]	0.139	0.063	0.609	0.793	0.948	0.984
Xu and Wang [55]	0.121	0.052	0.586	0.811	0.954	0.987
Joint HCRF [51]	0.220	0.094	0.745	0.605	0.890	0.970
Jafari et al.[21]	0.157	0.068	0.673	0.762	0.948	0.988
PAD-Net [53]	0.120	0.055	0.582	0.817	0.954	0.987
Ours	0.122	0.051	0.582	0.819	0.953	0.988

Table 4. Table IV: Quantitative evaluation for depth estimation on our dataset.

Method	Error (lower is better)			Accuracy (higher is better)
Method	rel	log10	rms	$δ < 1.15$	$δ < {1.15}^{2}$	$δ < {1.15}^{3}$
Eigen et al.[13]	0.948	0.285	4.711	0.054	0.205	0.492
Laina et al.[26]	0.404	0.235	3.433	0.102	0.310	0.581
Xu et al.[54]	0.175	0.089	1.010	0.435	0.700	0.907
Xu and Wang [55]	0.151	0.067	0.620	0.536	0.817	0.975
Ours	0.136	0.062	0.507	0.568	0.918	0.982

Table 5. Table V: Quantitative evaluation for semantic segmentation on the NYUv2-40 dataset.

Method	Pixel Accuracy	Mean Accuracy	IoU
Deng et al.[10]	63.8	31.5	-
FCN [35]	60.0	42.2	29.2
FCN-HHA [35]	65.4	46.1	34.0
Eigen et al.[12]	65.6	45.1	34.1
Lin et al.[29]	70.0	53.6	40.6
RefineNet[28]	73.6	58.9	46.5
Kong et al.[24]	72.1	-	44.5
Saxena et al.[44]	-	55.7	43.1
Gupta et al.[16]	60.3	-	28.6
Mousavian et al.[37]	68.6	52.3	39.2
Ours	74.3	59.4	48.7

Table 6. Table VI: Quantitative evaluation for semantic segmentation on our dataset.

Method	Pixel Accuracy	Mean Accuracy	IoU
FCN [35]	47.07	33.76	24.63
Chen et al.[5]	66.28	67.98	53.90
Li et al.[27]	61.97	46.93	40.46
Zhao et al.[57]	74.82	72.36	60.91
Ours	75.54	74.49	63.98

Table 7. Table VII: Quantitative evaluation for multi-view reconstruction.

Method	Accuracy	Completeness
Method	(lower is better)	(higher is better)
COLMAP [46]	3.74	2.33%
PMVS2 [14]	3.71	1.83%
OpenMVS [41]	3.68	1.25%
DeepMVS [19]	21.49	12.47%
Ours	17.72	31.55%

Equations4

\tilde{k} = v \in {1, 2, ..., n_{j}} arg min (∥ p_{k} (x, y, z) - p_{v}^{'} (x, y, z) ∥^{2} + w_{1} ∥ p_{k} (r, g, b) - p_{v}^{^{'}} (r, g, b) ∥^{2} + w_{2} ∥ p_{k} (s) - p_{v}^{^{'}} (s) ∥^{2}),

\tilde{k} = v \in {1, 2, ..., n_{j}} arg min (∥ p_{k} (x, y, z) - p_{v}^{'} (x, y, z) ∥^{2} + w_{1} ∥ p_{k} (r, g, b) - p_{v}^{^{'}} (r, g, b) ∥^{2} + w_{2} ∥ p_{k} (s) - p_{v}^{^{'}} (s) ∥^{2}),

(R_{i}, t_{i}) = R, t arg min \frac{1}{2} (p_{k}, p_{\tilde{k}}^{'}) \in C_{i, j}) \sum ∥ p_{\tilde{k}}^{'} - R p_{k} - t ∥_{2}^{2} .

(R_{i}, t_{i}) = R, t arg min \frac{1}{2} (p_{k}, p_{\tilde{k}}^{'}) \in C_{i, j}) \sum ∥ p_{\tilde{k}}^{'} - R p_{k} - t ∥_{2}^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Optical measurement and interference techniques

Full text

Learning to Reconstruct and Understand Indoor Scenes from Sparse Views

Jingyu Yang, Ji Xu, Kun Li, Yu-Kun Lai,

Huanjing Yue, Jianzhi Lu, Hao Wu, and Yebin Liu

Jingyu Yang, Ji Xu, Hao Wu, and Huanjing Yue are with the School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China. Kun Li is with the Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China. Yu-Kun Lai is with the School of Computer Science and Informatics, Cardiff University, Wales, UK. Jianzhi Lu is with the 3vjia company. Yebin Liu is with the Department of Automation, Tsinghua University, Beijing 10084, China. Corresponding author: Kun Li (Email: [email protected])

Abstract

This paper proposes a new method for simultaneous 3D reconstruction and semantic segmentation of indoor scenes. Unlike existing methods that require recording a video using a color camera and/or a depth camera, our method only needs a small number of (e.g., 3-5) color images from uncalibrated sparse views as input, which greatly simplifies data acquisition and extends applicable scenarios. Since different views have limited overlaps, our method allows a single image as input to discern the depth and semantic information of the scene. The key issue is how to recover relatively accurate depth from single images and reconstruct a 3D scene by fusing very few depth maps. To address this problem, we first design an iterative deep architecture, IterNet, that estimates depth and semantic segmentation alternately, so that they benefit each other. To deal with the little overlap and non-rigid transformation between views, we further propose a joint global and local registration method to reconstruct a 3D scene with semantic information from sparse views. We also make available a new indoor synthetic dataset simultaneously providing photorealistic high-resolution RGB images, accurate depth maps and pixel-level semantic labels for thousands of complex layouts, useful for training and evaluation. Experimental results on public datasets and our dataset demonstrate that our method achieves more accurate depth estimation, smaller semantic segmentation errors and better 3D reconstruction results, compared with state-of-the-art methods.

Index Terms:

3D reconstruction, semantic segmentation, indoor scenes, sparse view

I Introduction

With the increasing demand for indoor navigation, home/office design, and augmented reality, indoor 3D reconstruction and understanding have become active topics in computer vision and graphics. Existing reconstruction methods can be broadly categorized into two groups. The first group scans indoor scenes with an integrated depth camera based on either time-of-flight (ToF) or structured light sensing that offers dense measurements of depth. The pioneering KinectFusion [40] presents a detailed workflow using Kinect for indoor reconstruction. It was more recently extended by ElasticFusion [52] and BundleFusion [8] which achieve state-of-the-art results in real-time 3D reconstruction. Despite that it is relatively simple to acquire depth, the depth captured by such methods contains much noise and missing data, and is limited to a small range of distances. Color cameras do not suffer from these issues, are still far more available (*e.g.*on mobile phones) and have a smaller form factor than depth cameras. It is therefore interesting to study 3D scene reconstruction using a color camera, which however is challenging due to lack of depth information. Simultaneous localization and mapping (SLAM) [34] and structure from motion (SFM) [6] are two popular approaches to achieve feature-based point cloud 3D reconstruction on-line and off-line, respectively. However, these feature-based methods require rich textures in the scene, and are therefore difficult to obtain dense point clouds. All the above methods require consecutive frame tracking or dense view capturing.

In this paper, we propose a new indoor-scene 3D reconstruction and semantic segmentation method using color images captured from several uncalibrated sparse views. The first challenge is the difficulty in dense reconstruction from sparse views with little overlap, which is practically degenerated into monocular depth estimation. The second challenge, hence, is non-rigid transformation between views brought in by the inaccurate depth estimated from single color images. To address these problems, we design IterNet, an iteratively optimized deep framework for simultaneous depth map recovery and semantic segmentation for each view, where the two tasks help improve each other. To estimate non-rigid transformations between sparse views, we further develop a joint global and local alignment method to fuse estimated depths with the help of semantic information, which integrates geometry, photometry and semantic information in the coarse-to-fine manner.

Depth recovery and semantic segmentation from images are ill-posed and it is essential to learn from high-quality training data. For indoor scene understanding, a number of datasets have been made publicly available. Real-world datasets, such as NYUv2 [39], SUN RGB-D [48] and ScanNet [7], need a lot of manual labor to annotate the labels and contain unavoidable noise in depths assumed as ground-truths, while synthetic datasets [49, 17] are difficult to generate photorealistic RGB images and usually have limited layouts and image resolution. To our best knowledge, no existing datasets can provide photorealistic RGB images, accurate depth maps, pixel-level semantic labels, and thousands of complex layouts at the same time. To address this, we build IterNet RGB-D dataset with these features.

Experimental results on both public datasets and our dataset demonstrate that our method outperforms state-of-the-art methods on depth estimation, semantic segmentation, and multi-view reconstruction. Figure 1 gives an example of our IterNet RGB-D dataset and the reconstructed 3D model with estimated semantics using our IterNet. We will make the code and the dataset available online for research purposes.

In summary, our work is an integrated work that includes 1) an unprecedented indoor synthetic dataset simultaneously providing photorealistic high-resolution RGB images, accurate depth maps and pixel-level semantic labels for thousands of complex layouts, 2) a depth estimation method from a single color image, 3) a semantic segmentation method from a single view, and 4) a multi-view reconstruction method for sparse views. Each component of our method has novelty and is proved by experiments on public datasets and our dataset. They jointly solve a challenging problem of 3D reconstruction and understanding from sparse views. Our main contributions are:

•

We provide IterNet RGB-D dataset including photorealistic high-resolution RGB images, accurate depth maps, and pixel-level semantic labels for thousands of complex layouts, useful for training and evaluation.

•

We solve a challenging problem, namely reconstructing and understanding indoor 3D scenes using only color images captured from several uncalibrated sparse views. It is applicable to more scenarios than previous approaches that rely on texture and/or geometries of dense views, e.g., reconstructing and understanding a room using several photos captured by different users.

•

We design a novel iterative joint optimization method for depth estimation and semantic segmentation for a given input color image, where the two tasks help improve each other. This architecture is not restricted to these tasks we address here and can also be extended to other related tasks such as object/part parsing.

•

We propose a joint global and local registration method to fuse different sparse perspectives. This coarse-to-fine alignment is robust to the sparsity of views and the errors of monocular depth estimation.

II Related work

Indoor datasets. Naseer et al.[38] gave a comprehensive overview of indoor scene understanding in 2.5/3D. The first dataset is NYU-Depth with two versions introduced by Silberman et al.[39] using Microsoft Kinect. SUN RGB-D dataset [48] captured by four different RGB-D sensors contains 10,335 indoor images with dense annotations. Armeni et al.[2] provided Building Parser dataset with instance level semantic and geometric annotations. Matterport3D [4] contains 10,800 panoramic images covering $360^{\circ}$ viewpoints captured by a Matterport camera. ScanNet [7] is a 3D reconstruction dataset with 2.5 million frames obtained from 1,513 scans. These real-world datasets usually have some noise and missing areas in depth maps and need a lot of manual effort to annotate the labels. Hence, synthetic datasets are proposed for easy generation and accurate ground-truth. SUNCG [49] is a densely annotated large-scale indoor dataset, but the rendered RGB images are not photorealistic and RGB-D videos are not available. SceneNet RGB-D [17] provides pixel-level annotations and photorealistic RGB images, but the number of layouts is limited. Table I compares various publicly available 2.5/3D indoor datasets with our IterNet RGB-D dataset. Our dataset provides a total of 12,856 photorealistic images for thousands of layouts, and has a higher image resolution: $1280\times 960$ and $1280\times 720$ , covering more indoor scenes. Moreover, our dataset provides absolute depth maps and pixel-level semantic segmentation that are more precise and accurate. Compared with other datasets, the indoor scenes covered by our dataset are more general and more complex.

Monocular Depth Estimation. In computer vision, monocular depth estimation has been a long-standing topic in the last decades. Previous approaches mainly focused on hand-crafted features [18], defocused features[30], statistical priors[20] or graphical models [32]. With the development of deep learning, more recent approaches are based on Convolutional Neural Networks (CNNs). For instance, Eigen et al. [13] proposed a multi-scale CNN for depth estimation and demonstrated the effectiveness of the CNN-based method with promising results. Considering the correlation between tasks, Wang et al.[51] introduced a CNN for joint depth estimation and semantic segmentation. Xu et al.[53] proposed a multi-task approach for depth estimation via cross-modal interactions to refine the task. Recently, the attention mechanism has become popular, and Xu et al.[55] proposed a structured attention mechanism to fuse the features of different scales. The most similar work to ours is [54] where a continuous Conditional Random Field (CRF) is used to combine multi-scale features. Our approach develops from a similar intuition but further integrates semantic information in an iterative way.

Semantic Segmentation. Semantic segmentation is an extension of image classification. Instead of classifying an image as a whole, semantic segmentation assigns per-pixel predictions of object categories for the given image. It is challenging due to randomness of object distribution, poor illumination, and occlusion. Deng et al.[9] proposed a robust information theoretic (RIT) model to reduce the uncertainties, i.e., missing and noisy labels, by learning a transformation function and a discriminative classifier that maximize the mutual information of data and their labels in the latent space. Alterative approaches are typically based on CNNs. Long et al.[35] proposed a Fully Convolutional Network (FCN), a popular CNN architecture for dense predictions without any fully connected layers. Almost all the subsequent approaches on semantic segmentation adopted this paradigm. With the development of depth sensors and the release of RGB-D datasets, some methods attempted to use depth information for better segmentation, no longer limited to a single RGB image. Li et al.[27] constructed HHA images [16] for the depth channel through geometric encoding before feeding them to the network and used Long Short-Term Memory (LSTM) to fuse two different features. Ma et al.[36] predicted semantic segmentation from RGB-D sequences, but it is inapplicable to sparse views. Our method exploits depth information to help improve semantic segmentation, but the depth is estimated from the input color image instead of directly captured by a dedicated depth sensor. We propose an iterative method for joint estimation of the depth and semantic segmentation, which benefit each other.

Indoor Scene 3D Reconstruction. Indoor Scene 3D Reconstruction from a color video or multi-view color images is a challenging and active topic. Given a color video, most structure from motion (SFM) methods [47] recovered the 3D structure by estimating the motion of the cameras corresponding to the frames. However, it is difficult for these methods to obtain dense and accurate reconstruction. Given multi-view color images with calibrated camera parameters, multi-view stereo (MVS) methods [33] can achieve more accurate 3D reconstruction. But they require adjacent views to have sufficient overlap and cannot work well with sparse views. COLMAP [45, 46] provides a pipeline containing both SFM and MVS with graphical and command-line interfaces. When the views of images are very sparse, the depth of each image can be estimated and fused together using iterative closest point (ICP) like registration methods [15]. However, it is difficult to achieve accurate depth estimation from individual color images which increases the difficulties of ICP fusion. Saxena et al.[43] proposed a novel method for 3D reconstruction from sparse views, but it only worked well for building-like outdoor scenes and cannot generate semantics. Learning-based methods, e.g., MVSNet [56] and DeepMVS [19], output the depth of a specific frame based on a color multi-view sequence, but they cannot deal with sparse views. In this paper, we design IterNet to estimate a more accurate depth map with the help of semantic segmentation, and propose a joint global and local registration method to better achieve indoor scene 3D reconstruction from sparse views.

III Proposed Method

In this section, we first introduce our IterNet RGB-D dataset in Section III-A, and then describe the technical details of IterNet for iterative joint depth estimation and semantic segmentation in Section III-B. The joint global and local multi-view reconstruction method is presented in Section III-C. Figure 2 illustrates the workflow of our method.

III-A Dataset

Different from the production of other synthetic datasets [17, 49], our dataset is generated by a third-party platform which includes various real-life house styles, real prototype rooms designed by professional designers, and detailed model materials. We also implement high-quality photorealistic rendering. Compared to traditional rendering, we adopt the method of image splitting and recombination to achieve distributed rendering. To accelerate the rendering speed, we utilize the computing power of multiple servers with CPUs, thus multiplying the rendering speed. The average rendering time of a $1280\times 960$ image is about 90 seconds. Our rendering is realized on a cluster of 32 servers, each consisting of a CPU with 32 cores and 64 threads. For rendering 12,856 images, it takes about 321 hours. In terms of rendering quality, in addition to considering the direct illumination of the light source in the scene, the illumination reflected by other objects, known as Global Illumination (GI), is also taken into consideration. There are many ways to achieve GI. In order to render better results, we adopt the Brute Force (BF) algorithm [50] based on path tracking. The number of samples per pixel is up to 512 and varies for different scenes. The noise level is controlled below 0.05. A lower noise level yields better rendering quality, but requires longer rendering time. In order to obtain better results and minimize the rendering time, rendered images are denoised using a wavelet-based denoising method [11]. Figure 3 shows some examples of different scenarios in our dataset. Our dataset provides photorealistic high-resolution RGB images, accurate depth maps and pixel-level semantic labels for thousands of layouts, useful for training and evaluation. Figure 4 shows more scenarios in our dataset. It can be seen that our dataset contains more complex indoor layouts, richer textures, colorful and realistic lightings, and higher resolution images, which are more photorealistic and closer to real-world images than existing synthetic datasets. Our dataset will be available online.

III-B IterNet: Iterative CNN for Joint Depth Estimation and Semantic Segmentation

Network Architecture. The proposed IterNet is a multi-task deep CNN mainly consisting of two parts: the depth estimation sub-network and the semantic segmentation sub-network, as shown in Figure 5.

In the design of the depth estimation sub-network, we refer to a monocular depth estimation method [54] using a continuous conditional random field (CCRF) to combine multi-scale features. Different from [54], we add a semantic branch built upon an encoder-decoder structure to extract semantic features and further use a CCRF to integrate the multi-scale RGB features and the semantic features which can better make use of boundary constraints in semantic segmentation. The RGB branch consists of a front-end base network and a refinement network combined with several CCRF modules. Together with semantic information, the output of the RGB branch is fed into a CCRF module to generate the estimation of depth which is used as the input of the semantic segmentation sub-network.

In the semantic segmentation sub-network, we use the Long Short-Term Memorized Context Fusion (LSTM-CF) [27] Model with different fusion scheme for the RGB-D features, which is capable of fusing contextual information from multiple sources (i.e. photometric and depth channels). Instead of the original serial vertical and horizontal context layers, we adopt a parallel context layer and a direct fusion scheme to better play the role of depth. We also add an Atrous Spatial Pyramid Pooling (ASPP) [5] as a multi-scale feature extractor. Unlike an encoder-decoder network extracting different intermediate layers to obtain multi-scale features, ASPP employs multiple parallel filters with different sampling rates. For depth information, rather than directly feeding a depth image into the network, we first encode it into an HHA image [16] using geocentric encoding and then input it into the network.

Training and Testing. Given datasets of RGB-Depth-Semantic triplets, our aim is to train the designed network for joint depth and semantic estimation. The depth estimation sub-network and semantic segmentation sub-network are designed to interact with each other to boost the performance. Instead of jointly training the two sub-networks, we train the depth estimation and semantic segmentation sub-networks sequentially for flexible boosting. Taking the depth estimation sub-network as an example, we train the upper branch and the lower branch with RGB-Depth pairs and Sematic-Depth pairs, respectively. The depth estimation sub-network is then fine-tuned with the RGB-Depth-Semantic triplets. The semantic segmentation sub-network is trained in a similar way.

At the test stage, since each sub-network expects the output of the other sub-network as part of input, we use the following strategy. We need an initialized semantic segmentation or depth estimation which can be easily obtained by disabling one of the branches in the original network structure. For example, if we want to obtain an initial depth estimation for semantic segmentation, we disable the semantic segmentation branch in the depth estimation sub-network and then extract features from RGB branches as an initial depth. We then alternately run the two sub-networks, with the output of one sub-network used as input for the other sub-network. The additional depth information helps improve semantic segmentation, and the semantic segmentation in turn contributes to improved depth estimation. In practice, we find that there is no significant improvement after 3 iterations, which shows quick convergence.

Implementation Details. The proposed approach is implemented on the Caffe framework [22] and runs on a computer with an Nvidia GTX 1080ti graphics card (11GB). For depth estimation sub-network, the learning rate is initialized at $10^{-11}$ and decreases by $10\%$ for every 30 epochs. The batch size is set to 16. The momentum and the weight decay are set to 0.9 and 0.0005, respectively. The semantic segmentation sub-network follows the same training rules, but the initial learning rate is set to $10^{-4}$ . The parameters of batch size, momentum and weight decay are set to 8, 0.9 and 0.005, respectively. The learning rate decreases by $10\%$ for every 20 epochs. When the pretraining of each branch is finished, we fine-tune the sub-networks, and the initial learning rates are set to $10^{-12}$ and $10^{-5}$ for depth estimation and semantic segmentation, respectively. The batch size, momentum and weight decay remain the same as the pretraining.

III-C Joint Global and Local Reconstruction

After obtaining the depth and the semantic segmentation for the image of each view, we reconstruct the whole 3D scene by fusing the depths of different views. The straightforward way is to use the ICP algorithm to align the point clouds transformed from the depths of different perspectives. However, it is difficult to achieve satisfactory alignment. First, the depths are obtained by a monocular depth estimation network, not captured by Kinect or other depth cameras, containing some non-statistical errors. It is therefore insufficient to align two depth point clouds with just one rigid transformation. Second, for sparse perspectives, the overlap between two adjacent views is limited which is difficult to handle by standard ICP algorithms. Hence, we propose a new joint global and local registration method by exploiting photometric and semantic information to improve reconstruction quality.

Before fusion, we filter the messy points based on the plane constraint similar to [3]. Let $\mathcal{X}\triangleq\{X_{i}\}=\{(C_{i},D_{i},S_{i})\}_{i=1}^{N}$ be the sparse view set, where $N$ is the total number of views for reconstruction. After depth estimation and semantic segmentation, each view now contains three components: color $C_{i}$ , depth $D_{i}$ and segmentation $S_{i}$ . We align all the depth point clouds in sequence with the previous registration result used as the next target model. Each alignment has two stages, namely global alignment and local alignment.

Global alignment. Taking the point cloud generated using the previous $i-1$ views as the target, our goal for global alignment is to find an optimal global rigid transformation $\mathcal{T}_{i}$ for view $i$ , which is composed of two parts: rotation $R_{i}$ and translation $t_{i}$ . Specifically, we first convert the depth map $D_{i}$ into a point cloud $\mathcal{P}_{i}=\{p_{1},p_{2},...,p_{n_{i}}\},i\in\{1,2,\dots,N\}$ . $\mathcal{P}_{i}$ is a point set for the i-th view, and $n_{i}$ represents the total number of points in the view. We take a global ICP-type framework alternating two steps, until convergence. The transformation is initialized by a $4\times 4$ identity matrix. Assuming the target point cloud is $\mathcal{P}_{j}$ containing all the fused points from the previous views, the first step finds for each point $p_{k}\in\mathcal{P}_{i}$ its corresponding point $p^{\prime}_{\tilde{k}}\in\mathcal{P}_{j}$ if possible, and the second step updates the transformation $\mathcal{T}_{i}$ such that when applied to $\mathcal{P}_{i}$ the point cloud is aligned with $\mathcal{P}_{j}$ .

In the first step, we exploit the additional photometric and semantic information. We lift each point $p_{k}\in\mathcal{P}_{i}$ from 3D to a point in a 7-dimensional (7D) space, $\hat{p}_{k}=(x_{k},y_{k},z_{k},r_{k},g_{k},b_{k},s_{k})$ , including its 3D position $(x_{k},y_{k},z_{k})$ , RGB color $(r_{k},g_{k},b_{k})$ and semantic label $s_{k}$ . Similarly, the point $p^{\prime}_{\tilde{k}}\in\mathcal{P}_{j}$ is lifted to a 7D point $\hat{p}^{\prime}_{\tilde{k}}$ . Our global registration method for aligning $\mathcal{P}_{i}$ and $\mathcal{P}_{j}$ first finds the corresponding point $p^{\prime}_{\tilde{k}}\in\mathcal{P}_{j}$ for each point $p_{k}$ in $\mathcal{P}_{i}$ by the following optimization:

[TABLE]

where $w_{1}$ and $w_{2}$ are weights to balance the importance of geometric, photometric and semantic information. They are set to be $w_{1}=0.1$ and $w_{2}=10$ in our experiments.

Due to limited overlap, not all the points in $\mathcal{P}_{i}$ have their corresponding points in $\mathcal{P}_{j}$ . We reject $p^{\prime}_{\tilde{k}}$ if the matching error is larger than a threshold. In our implementation, this threshold is set to 5cm, and correspondences with higher distances are ignored. Let $\mathcal{C}_{i,j}=\{p_{k},p^{\prime}_{\tilde{k}}\}$ be the set of retained correspondences. In the second step, since photometric and semantic matching errors are independent of rigid transformations, we use a standard ICP algorithm [15] to find the transformation between the two point clouds:

[TABLE]

Local alignment. Using the 7D global registration method, we achieve coarse alignment which broadly aligns different views, but still cannot cope with the problem of non-statistical errors in monocular depth estimation, as such local deformation is no longer rigid. To address this problem, we further propose a local registration strategy to refine the previous coarse estimation, similar to coarse-to-fine refinement. Specifically, we first extract local point sets from the original point cloud according to their semantic labels, and then register each of them using the above method. Note that in this case, a subset of points from one view is only matched to subsets of points with the same semantic label. Therefore, when finding the matched point, the semantic difference term in Eq. (1) is always zero. For each local set, once it is aligned, we fuse the registered parts from different views by averaging 3D positions of overlaps to mitigate the influence of noise. The key for our joint global and local registration method is to use multiple transformations to register sparse views with coarse-to-fine refinement, rather than just one single transformation, which is more robust to the noise and outliers in the monocular depth estimation.

IV Experimental Results

IV-A Ablation Study

We compare the full model with full model without semantic segmentation and full model without depth estimation in Table II. It can be seen that our full model has achieved the best performance. Figure 6 shows the fusion results of an ICP matching method [15], 4PCS [1], global alignment using the estimated depth without the help of semantic branch, and our proposed joint global and local registration method. Some misalignments occur in local areas for standard ICP methods. On the contrary, our method achieves better fusion result in terms of both global structure and local details.

Our iterative scheme in IterNet usually converges to promising results after three iterations and is stable for various images. Figure 7 shows the decreasing of average RMS (root mean squared) errors of depth estimation over all the test images in iterations and the increasing of average pixel accuracy of semantic segmentation over all the test images in iterations. It can be seen that there is no significant improvement for both depth estimation and semantic segmentation beyond three iterations.

To study and verify the role of IterNet in depth estimation, we compare two recent backbone architectures including Structured Attention Guided Convolution Neural Fields [55] and CCRF [54] which achieve promising performance in depth estimation. Figure 8 shows the comparison results on our IterNet RGB-D dataset. We crop high resolution images into small pieces of 426 $\times$ 426 and feed them into the networks. It can be seen that our framework significantly enhances the attention with clear object structures, and refines the CCRF architecture with sharper contours for some objects such as the pillow and the chair.

IV-B Depth Estimation

We compare our approach with several state-of-the-art methods on NYUv2 dataset [39] in Table III. We use 795 images for training and the other 654 images for testing as other methods did. We also use the same raw data as other methods and adopt data augmentation (finally 4770 images for training) to avoid the over-fitting problem. Referring to previous work [12, 13, 51], we evaluate the depth estimation results with the following metrics: (1) mean relative error (rel): $\frac{1}{P}\sum_{i}\frac{\left|d_{i}-d_{i}^{*}\right|}{d_{i}^{*}}$ ; (2) root mean squared error (rms): $\sqrt{\frac{1}{P}\sum_{i}(d_{i}-d_{i}^{*})^{2}}$ ; (3) mean log10 error (log10): $\frac{1}{P}\sum_{i}\lVert\log_{10}(d_{i})-\log_{10}(d_{i}^{*})\rVert$ and (4) accuracy with threshold $t$ : percentage( $\%$ ) of $d_{i}^{*}$ subject to max $(\frac{d_{i}^{*}}{d_{i}},\frac{d_{i}}{d_{i}^{*}})=\delta<t$ , where $d_{i}$ and $d_{i}^{*}$ denote the predicted depth value and the ground-truth value for pixel $i$ . $P$ is the total number of pixels. The results of the compared methods are quoted from their papers. Our method outperforms thirteen competing methods in all metrics, and is comparable to PAD-Net [53] which has a more complex network structure and requires ground-truth contours and normals as part of labels. We run multiple training trials and consistently achieve the results. We also quantitatively evaluate some methods with their provided code on our IterNet RGB-D dataset. As shown in Table IV, our method achieves the most accurate depth estimation on all the metrics. Figure 9 gives some visual comparison results on NYUv2 dataset [39] and our dataset. Figure 10 gives more qualitative comparison results with enlarged local areas on NYUv2 dataset [39] and our dataset. It can be seen that our method achieves more accurate depth estimation consistent with the quantitative evaluation. Although [54] also has good visual results due to promising estimation of relative depths between objects, our method achieves more accurate results both visually and quantitatively.

To evaluate the generalizability of our model trained by our dataset, we show some depth estimation results for real indoor scenes on NYUv2 dataset [39] and SUN RGB-D dataset [48] without finetuning in Figure 11. It can be seen that our model trained using our dataset has good generalization ability to other datasets.

IV-C Semantic Segmentation

To evaluate the performance of semantic segmentation, we use NYUv2-40 dataset [35] in which all objects in the NYUv2 dataset [39] are divided into 40 categories. We use the same training and testing data as other methods and adopt three metrics in percentage ( $\%$ ): pixel accuracy, mean accuracy, and Intersection over Union (IoU). As shown in Table V, our inferred semantic segmentation results outperform those state-of-the-art methods. We also quantitatively evaluate some recent work that provide source code on our IterNet RGB-D dataset in Table VI. It can be seen that our method also achieves the best performance. Figure 12 presents some visual comparison results on NYUv2-40 dataset and our dataset mapped into 87 categories. Being consistent with the quantitative results in Table V and Table VI, our approach generates more accurate semantic segmentation results on both real dataset (NYUv2) and synthetic dataset (IterNet RGB-D) than state-of-the-art methods. More qualitative comparison results for semantic segmentation are depicted in Figure 13 and Figure 14. It can be observed that our approach generates more accurate semantic segmentation on both real dataset (NYUv2) and synthetic dataset (IterNet RGB-D) than other four competing methods.

IV-D Multi-view Reconstruction

In Figure 15, we evaluate multi-view 3D reconstruction performance of the proposed method on NYUv2 dataset [39] and our dataset using three wide-baseline views, compared with four state-of-the-art multi-view stereo methods: COLMAP [45, 46], PMVS2 [14], OpenMVS [41] and DeepMVS [19]. We obtain the sparse views for NYUv2 dataset by selecting 1 frame per 30-40 frames, and use the camera parameters estimated by COLMAP [45] for OpenMVS [41], PMVS2 [14] and DeepMVS [19]. As shown in Figure 15, COLMAP [45, 46] fails to generate meaningful results on NYUv2 dataset from sparse views. We can see obviously wrong points for PMVS2 [14] and OpenMVS [41]: some points gather together from side view and top view on NYUv2 dataset. Moreover, their obtained point clouds are too sparse to provide acceptable results by linear interpolation. DeepMVS reconstructs more points compared with the traditional methods, but the reconstructed model contains a lot of noise and outliers. On the contrary, our method achieves the best results for sparse multi-view reconstruction by considering 7-D information (geometry, photometry and semantics) and using joint global and local registration. More results on NYUv2 dataset [39] and our dataset using three or four sparse views are given in Figure 16 and Figure 17, respectively. It can be seen that the multi-view stereo method in COLMAP [46] fails to generate 3D point clouds, and the point clouds reconstructed by OpenMVS [41] and PMVS2 [14] lack sufficient density and completeness. Although DeepMVS [19] achieves dense reconstruction, the reconstructed model contains many wrong points. In contrast, our method achieves accurate and complete reconstruction from sparse views. Because COLMAP [46] fails for most scenes in NYUv2 dataset [39], we give quantitative evaluation on our dataset in Table VII. We use two indicators to evaluate the results of MVS reconstruction: accuracy and completeness. Accuracy represents the average distance between the points on reconstructed model and the nearest points on the ground-truth model. Completeness measures the percentage of the points on the ground-truth model that can find corresponding points on the reconstructed model within a certain distance threshold (0.1). We generate the 3D ground-truth model by fusing multi-view ground-truth depth point clouds using ICP. As shown in Table VII, our method achieves the most complete reconstruction and meanwhile ensures the accuracy. Although traditional multi-view stereo methods [46, 14, 41] have higher accuracy, their reconstructed points are too sparse to provide acceptable results by linear interpolation. Figure 18 shows our reconstructed models on NYUv2 dataset [39] and our dataset presented from five different views.

V Conclusions

In this paper, we solve a challenging problem: reconstructing and understanding indoor 3D scenes based on several color images captured from uncalibrated sparse views. We propose IterNet, a novel iterative network to jointly estimate depth map and semantic segmentation from a single color image, and a joint global and local registration method to reconstruct indoor 3D scenes from sparse views. We also introduce and make available IterNet RGB-D dataset, a new dataset that simultaneously provides high-resolution photorealistic RGB images, accurate depth maps, and pixel-level semantic labels for thousand of layouts. Experimental results on both public datasets and our dataset demonstrate that our method achieves the best results on depth estimation, semantic segmentation and multi-view reconstruction, compared with state-of-the-art methods.

Bibliography58

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. Aiger, N. J. Mitra, and D. Cohen-Or. 4-points congruent sets for robust surface registration. ACM Trans. Graphics , 27(3):#85, 1–10, 2008.
2[2] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese. Joint 2D-3D-semantic data for indoor scene understanding. ar Xiv preprint ar Xiv:1702.01105 , 2017.
3[3] A. Bódis-Szomorú, H. Riemenschneider, and L. Van Gool. Superpixel meshes for fast edge-preserving surface reconstruction. In CVPR , pages 2011–2020, 2015.
4[4] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport 3D: Learning from RGB-D data in indoor environments. ar Xiv preprint ar Xiv:1709.06158 , 2017.
5[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deep Lab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CR Fs. IEEE Trans. PAMI , 40(4):834–848, 2018.
6[6] H. Cui, S. Shen, W. Gao, and Z. Hu. Efficient large-scale structure from motion by fusing auxiliary imaging information. IEEE Trans. Image Processing , 24(11):3561–3573, 2015.
7[7] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner. Scan Net: Richly-annotated 3D reconstructions of indoor scenes. In CVPR , volume 2, page 10, 2017.
8[8] A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt. Bundle Fusion: Real-time globally consistent 3D reconstruction using on-the-fly surface reintegration. ACM Trans. Graphics , 36(4):76a, 2017.