Silhouette Guided Point Cloud Reconstruction beyond Occlusion
Chuhang Zou, Derek Hoiem

TL;DR
This paper introduces a silhouette-guided method for 3D point cloud reconstruction from a single RGB image, effectively handling occlusions and improving shape prediction accuracy.
Contribution
It presents a novel silhouette completion and guidance approach that enhances 3D reconstruction robustness beyond occlusion, with additional improvements using multi-view reprojection and surface refinement.
Findings
Improved reconstruction accuracy on synthetic and real datasets.
Effective silhouette completion for occluded objects.
Enhanced 3D shape prediction with multi-view loss and refinement.
Abstract
One major challenge in 3D reconstruction is to infer the complete shape geometry from partial foreground occlusions. In this paper, we propose a method to reconstruct the complete 3D shape of an object from a single RGB image, with robustness to occlusion. Given the image and a silhouette of the visible region, our approach completes the silhouette of the occluded region and then generates a point cloud. We show improvements for reconstruction of non-occluded and partially occluded objects by providing the predicted complete silhouette as guidance. We also improve state-of-the-art for 3D shape prediction with a 2D reprojection loss from multiple synthetic views and a surface-based smoothing and refinement step. Experiments demonstrate the efficacy of our approach both quantitatively and qualitatively on synthetic and real scene datasets.
| Category | CD | EMD | ||||
|---|---|---|---|---|---|---|
| PSG | Pixel2Mesh | Ours | PSG | Pixel2Mesh | Ours | |
| plane | 0.430 | 0.477 | 0.386 | 0.396 | 0.579 | 0.527 |
| bench | 0.629 | 0.624 | 0.436 | 1.113 | 0.965 | 0.815 |
| cabinet | 0.439 | 0.381 | 0.373 | 2.986 | 2.563 | 2.147 |
| car | 0.333 | 0.268 | 0.308 | 1.747 | 1.297 | 1.306 |
| chair | 0.645 | 0.610 | 0.606 | 1.946 | 1.399 | 1.257 |
| monitor | 0.722 | 0.755 | 0.501 | 1.891 | 1.536 | 1.314 |
| lamp | 1.193 | 1.295 | 0.969 | 1.222 | 1.314 | 1.007 |
| speaker | 0.756 | 0.739 | 0.632 | 3.490 | 2.951 | 2.441 |
| firearm | 0.423 | 0.453 | 0.463 | 0.397 | 0.667 | 0.572 |
| couch | 0.549 | 0.490 | 0.439 | 2.207 | 1.642 | 1.536 |
| table | 0.517 | 0.498 | 0.589 | 2.121 | 1.480 | 1.340 |
| cellphone | 0.438 | 0.421 | 0.332 | 1.019 | 0.724 | 0.674 |
| watercraft | 0.633 | 0.670 | 0.478 | 0.945 | 0.814 | 0.730 |
| mean | 0.593 | 0.591 | 0.501 | 1.653 | 1.380 | 1.205 |
| Method | CD | EMD |
|---|---|---|
| Ours w/o surface refine & w/o 2D proj loss | 0.398 | 1.784 |
| Ours w/o surface refine | 0.389 | 1.660 |
| Ours w/o 2D proj loss | 0.502 | 1.220 |
| Ours Full | 0.501 | 1.205 |
| Method | CD | EMD |
|---|---|---|
| 3D-LMNet [27] | 5.40 | 7.00 |
| Ours | 5.54 | 5.93 |
| Method | Full | Visible | Occluded |
|---|---|---|---|
| SeGAN [6] | 76.4 | 63.9 | 27.6 |
| Ours (ResNet-18) | 82.8 | 82.9 | 33.9 |
| Ours (ResNet-50) | 84.3 | 83.4 | 36.2 |
| Method | Training data | Occluded | Non-occluded | ||||
| sofa | chair | table | sofa | chair | table | ||
| Mask-RCNN | real | 84.34 | 59.05 | 60.14 | 91.99 | 69.96 | 60.88 |
| Ours | syn | 87.58 | 59.61 | 58.12 | 92.02 | 69.88 | 64.94 |
| Ours | syn+real | 88.56 | 59.25 | 68.83 | 92.19 | 72.01 | 56.90 |
| Method | CD | EMD | ||||
| sofa | chair | table | sofa | chair | table | |
| Ours w/o seg | 15.54 | 17.96 | 24.35 | 16.63 | 16.51 | 22.56 |
| Ours w/ pred vis seg | 9.15 | 13.20 | 17.96 | 9.29 | 13.38 | 17.81 |
| Ours w/ pred full seg | 8.70 | 13.14 | 16.50 | 8.81 | 13.04 | 16.36 |
| Ours w/ gt full seg | 8.27 | 10.16 | 11.36 | 8.11 | 10.47 | 11.44 |
| Method | CD | EMD | ||||
| sofa | chair | table | sofa | chair | table | |
| Ours w/o seg | 12.62 | 16.00 | 20.65 | 13.18 | 15.44 | 19.92 |
| Ours w/ pred vis seg | 8.75 | 11.34 | 15.55 | 8.66 | 11.84 | 15.75 |
| Ours w/ pred full seg | 8.42 | 10.82 | 13.65 | 8.40 | 11.13 | 13.85 |
| Ours w/ gt full seg | 8.24 | 9.21 | 10.50 | 8.18 | 9.66 | 11.44 |
| Category | CD | EMD | ||
|---|---|---|---|---|
| Ours | Ours w/ silhouette | Ours | Ours w/ silhouette | |
| plane | 0.386 | 0.395 | 0.527 | 0.552 |
| bench | 0.436 | 0.450 | 0.815 | 0.826 |
| cabinet | 0.373 | 0.365 | 2.147 | 2.154 |
| car | 0.308 | 0.301 | 1.306 | 1.302 |
| chair | 0.606 | 0.637 | 1.257 | 1.275 |
| monitor | 0.501 | 0.519 | 1.314 | 1.338 |
| lamp | 0.969 | 0.991 | 1.007 | 1.005 |
| speaker | 0.632 | 0.626 | 2.441 | 2.408 |
| firearm | 0.463 | 0.480 | 0.572 | 0.580 |
| couch | 0.439 | 0.455 | 1.536 | 1.568 |
| table | 0.589 | 0.604 | 1.340 | 1.338 |
| cellphone | 0.332 | 0.339 | 0.674 | 0.687 |
| watercraft | 0.478 | 0.492 | 0.730 | 0.747 |
| mean | 0.501 | 0.512 | 1.205 | 1.214 |
| Method | Training data | Slightly Occluded | Highly Occluded | ||||
| sofa | chair | table | sofa | chair | table | ||
| Mask-RCNN | real | 89.10 | 63.57 | 61.02 | 78.79 | 56.87 | 59.56 |
| Ours | syn | 90.18 | 63.88 | 60.09 | 84.54 | 57.55 | 56.83 |
| Ours | syn+real | 90.45 | 63.89 | 66.45 | 86.36 | 57.01 | 60.47 |
| Method | CD | EMD | ||||
| sofa | chair | table | sofa | chair | table | |
| Ours w/o seg | 15.34 | 17.82 | 22.84 | 16.40 | 16.48 | 21.25 |
| Ours w/ pred vis seg | 9.04 | 12.46 | 16.28 | 8.99 | 12.75 | 16.54 |
| Ours w/ pred full seg | 8.58 | 11.83 | 13.92 | 8.49 | 12.00 | 13.96 |
| Ours w/ gt full seg | 8.26 | 9.58 | 8.77 | 7.97 | 9.98 | 9.59 |
| Method | CD | EMD | ||||
| sofa | chair | table | sofa | chair | table | |
| Ours w/o seg | 15.77 | 18.03 | 25.33 | 16.90 | 16.52 | 23.41 |
| Ours w/ pred vis seg | 9.27 | 13.55 | 19.06 | 9.65 | 13.68 | 18.64 |
| Ours w/ pred full seg | 8.85 | 13.76 | 18.19 | 9.19 | 13.53 | 17.93 |
| Ours w/ gt full seg | 8.28 | 10.44 | 13.05 | 8.27 | 10.71 | 12.65 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · 3D Surveying and Cultural Heritage
Silhouette Guided Point Cloud Reconstruction beyond Occlusion
Chuhang Zou Derek Hoiem
University of Illinois at Urbana-Champaign
{czou4, dhoiem}@illinois.edu
Abstract
One major challenge in 3D reconstruction is to infer the complete shape geometry from partial foreground occlusions. In this paper, we propose a method to reconstruct the complete 3D shape of an object from a single RGB image, with robustness to occlusion. Given the image and a silhouette of the visible region, our approach completes the silhouette of the occluded region and then generates a point cloud. We show improvements for reconstruction of non-occluded and partially occluded objects by providing the predicted complete silhouette as guidance. We also improve state-of-the-art for 3D shape prediction with a 2D reprojection loss from multiple synthetic views and a surface-based smoothing and refinement step. Experiments demonstrate the efficacy of our approach both quantitatively and qualitatively on synthetic and real scene datasets.
1 Introduction
3D reconstruction from 2D images has many applications in robotics and augmented reality. One major challenge is to infer the complete shape of a partially occluded object. Occlusion frequently occurs in natural scenes: e.g. we often see an image of a sofa occluded by a table in front and a dining table partially occluded by a vase on top. Even multi-view approaches [34, 12, 19] may fail to recover complete shape, since occlusions may block most views of the object. Single-view learning-based methods [13, 6, 44] have approached seeing beyond occlusion as a 2D semantic segmentation completion task, but complete 3D shape recovery adds the challenges of predicting 3D shape from a 2D image and being robust to the unknown existence and extent of an occluding region.
In this paper, our goal is to reconstruct a complete 3D shape from a single RGB image, in a way that is robust to occlusions. We follow a data-driven approach, using convolution neural networks (CNNs) to encode shape-relevant features and decode them into an object point cloud. To simplify the shape prediction, we split the task into: (1) determining the visible region of the object; (2) predicting a completed silhouette (filling in any occluded regions); and (3) predicting the object 3D point cloud based on the silhouette and RGB image (Fig. 1). We reconstruct the object in a viewer-centered manner, inferring both object shape and pose. We show that, provided with ground truth silhouettes, shape prediction achieves nearly the same performance for occluded objects as non-occluded objects. We obtain the visible portion of the silhouette using Mask-RCNN [16] and then predict the completed silhouette using an auto-encoder. Using the predicted silhouette as part of shape prediction also yields large improvements for both occluded and non-occluded objects, indicating that providing an explicit foreground/background separation for the object in RGB images is helpful 111Code and data is available at https://github.com/zouchuhang/Silhouette-Guided-3D.
Our reconstruction represents a 3D shape as a set of point clouds, which is flexible and easy to transform. Our method follows an encoder-decoder strategy, and we demonstrate performance gains using a 2D reprojection loss from multiple synthetic views and a surface-based post refinement step, achieving state-of-the-art. Our silhouette guidance approach is related to shape from silhouette [21, 3, 24], but our silhouette guidance is part of learning approach rather than explicit constraint.
Our contributions:
- •
We improve the state-of-the-art for 3D point clouds reconstruction from a single RGB image. We show performance gains by using a 2D reprojection loss on multiple synthetic views and a surface-based refinement step.
- •
We demonstrate that completing the visible silhouette leads to better object shape completion. We propose a silhouette completion network that achieves the state-of-the-art. We show improvements for reconstruction of non-occluded and partially occluded objects.
2 Related Work
Single image 3D shape reconstruction is an active topic of research. Approaches use RGB images [36, 39, 37, 11, 31], depth images [42, 46, 40, 30] or both [14, 9, 15]. Approaches include exemplar based shape retrieval and alignment [2, 1, 14, 17], deformations from meshes [20, 25, 36], or a direct prediction via convolution neural networks [39, 41, 38]. Qi et al. [28] propose a novel deep net architecture suitable for consuming unordered point sets in 3D; Fan et al. [7] propose to generate point clouds from a single RGB image using generative models. More recent approaches improve point set reconstruction performance by learning representative latent features [27] or by imposing constraints of geometric appearance in multiple views [18].
Most of these approaches are applied to non-occluded objects with clean backgrounds and no occlusions, which may prevent their application to natural images. Sun et al. [32] conduct experiments on real images from Pix3D, a large-scale dataset with aligned ground-truth 3D shapes, but do not consider the problem of occlusion. We are concerned with predicting shape of objects in natural scenes, which may be partly occluded. Our approach improves the state-of-the-art for object point set generation, and is extended to reconstruct beyond occlusion with the guidance of completed silhouettes. Our silhouettes guidance is closely related to the human depth estimation by Rematas et al. [29]. However, Rematas et al. use the visible silhouette (semantic segmentation) rather than a complete silhouette, making it hard to predict overlapped (occluded) regions. Differently, our approach conditions on predicted silhouette to resolve occlusion ambiguity, and is able to predict complete 3D shape rather than 2.5D depth points.
Seeing beyond occlusion. Occlusions have long been an obstacle in multi-view reconstruction. Solutions have been proposed to recover portions of surfaces from single views, e.g. with synthetic apertures [35, 8], or to otherwise improve robustness of matching and completion functions from multiple views [34, 12, 19]. Other work decompose a scene into layered depth maps from RGBD [26] images or video [45] and then seek to complete the occluded portions of the maps. But errors in layered segmentation can severely degrade the recovery of the occluded region. Learning-based approaches [13, 6, 44] have posed recovery from occlusion as a 2D semantic segmentation completion task. Ehsani et al. [6] propose to complete the silhouette and texture of an occluded object. Our silhouette completion network is most similar to Ehsani et al., but we ease the task by predicting the complete silhouette rather than the full texture. We demonstrate better performance with our up-sampling based convolution decoder instead of fully connected layers used in Ehsani et al. Moreover, We go further to try to predict the complete 3D shape of the occluded object.
3 Point Clouds from a Single RGB Image
Direct point set prediction from a single image is challenging due to the unknown camera viewpoint or object pose and large intraclass variations in shape. This requires a careful design choice on the network architecture. We aim to have an encoder that can capture object pose and shape from a single image, and a decoder that is flexible in producing unordered, dense point clouds.
In this section, we introduce our point prediction network architecture, the training scheme including a 2D reprojection loss on multiple synthetic views to improve performance (Sec. 3.1). We then introduce a post-refinement step via surface-based fitting (Sec. 3.2) to produce smooth and uniformly distributed point sets.
3.1 Point Cloud Reconstruction Network
Our network architecture is illustrated in Fig. 2. The network predicts 3D point clouds in viewer-centered coordinates. The encoder is based on ResNet-50 [16] to better capture object shape and pose feature. The decoder follows a coarse-to-fine multi-stage generation scheme in order to efficiently predict dense points with limited memory. Our decoder follows the design of PCN [43]. The coarse predictor predicts sparse points. The refinement branch produces 4 finer points, by learning a up-sampling surface grid centered on each coarse point via local folding operation. We experimented with a higher up-sampling rate (e.g. 9, 16) as PCN but observed repetitive patterns across all surface patches, missing local shape details. Note that our network is able to generate a denser prediction with another up-sampling branch on top, but the current structure best balances accuracy and training/inference speed. Our reconstruction network does not require features from partial points like PCN, and produces an on-par performance with PSG [7], a state-of-the-art method in point set generation (see experiments in Sec. 5.3), even without the refinement step we will introduce in Sec. 3.2.
Loss function. We consider the training loss in 3D space using the bidirectional Chamfer distance. Given predicted point clouds and the ground truth , we have:
[TABLE]
To further boost the performance, we propose a 2D reprojection loss on point sets as follows:
[TABLE]
Where is a 2D projection operation from 3D space, with 3D rotation and translation in world coordinates and a known camera intrinsic . Since our reconstruction is viewer-centered, we can simply set assuming projections on the image plane. Our 2D reprojection loss is an unidirectional Chamfer distance; we only penalize the average distance from each projected ground truth point to the nearest projection of predicted point cloud. This is because the Chamfer distance on another direction tends to be redundant. When the predicted point is projected inside the ground truth 2D segmentation, the distance to the nearest projected ground truth points tends to zero, resulting in a small gradient and having less effect for learning better 3D point clouds. Although we project points instead of surfaces or voxel occupancy, producing non-continuous 2D segments, our 2D reprojection loss is computational efficient and shows promising improvements in experiment. Moreover, fitting a surface for post-refinement (Sec. 3.2) to these points is effective.
We can extend Eq. 2 to project 3D points onto multiple orthographic synthetic views: e.g. projecting to plane, plane (image plane) or plane in world coordinates (detailed illustration in Appx. A). In this case we can simply change the rotation matrix based on each view. Our 2D reprojection loss does not require additional rendered 2D silhouette ground truth of known view points, which makes the training possible on the dataset where the 3D ground truth is available.
When being projected on the plane (image plane), the ground truth is a subset of the ground truth silhouette. We thus use 2D points sampled from the ground truth segmentation mask instead of projecting ground truth 3D points :
[TABLE]
The overall loss function is shown below:
[TABLE]
which is the weighted summation over the 3D Chamfer loss and the 2D reprojection losses. Here is the 2D reprojection loss on the image plane, and is the projection on planes. Note that different from PCN, our network only penalizes on the finest output, which helps ease training and shows no performance degrades.
Implementation details. Our network gets as input a image with pixel values normalized to . The bottleneck feature size is 1024. The coarse decoder consists two fc layers, with feature size of 1024 and 3072 and ReLU in between. We set the surface grid for point up-sampling to be zero-centered with a side length of 0.1. We use the ResNet encoder pre-trained from ImageNet and apply a stage-wise training scheme for faster convergence and easier training: first train to predict coarse point cloud, fix the trained layers, then train the up-sampling header, and finally train the whole network end-to-end. We use ADAM [23] to update network parameters with a learning rate of and and batch size 32. We set , and in Eq. 4 based on grid search in the validation set.
Data augmentation. We augment the training samples by gamma correction with between 0.5-2. We re-scale image intensity with a minimum intensity ranges between 0-127 and a fixed maximum intensity of 255. We add color jittering to each RGB channel independently by multiplying a factor ranges in 0.8-1.2. Each augmentation parameter is uniformly and randomly sampled from the defined range.
3.2 Surface-based Point Clouds Refinement
One important 3D shape property is the smooth and continuous shape surfaces, especially for thin structures like chair legs and light stands. To impose this property, we perform a post-refinement step (Fig. 3), fitting surfaces from dense points, smoothing surfaces and uniformly re-sampling points from the surfaces again. Our surface fitting method is based on Floating Scale Surface Reconstruction (FSSR) [10], which is the state-of-the-art for surface-based reconstruction from dense point clouds. We set the parameter of point-wise normal by plane fitting on 6 neareast neighbors, and set the per-point scale as the average distance to the two closest points. A mesh cleaning step goes after FSSR to remove small and redundant patches. We also experimented with Poisson surface reconstruction [22], but FSSR produces a better surface in our case.
Smoothing. Given the fitted surfaces, we perform smoothing by implicit integration method [5]. We use curvature flow as the smoothing operator for 5 iterations. We then uniformly sample point clouds based on Poisson disc sampling to obtain our final output.
Our surface fitting, smoothing, and re-sampling enables production of evenly distributed points that can model thin structures and smooth surfaces, but the predicted shape may not be closed, preventing volumetric-based evaluation. Also, small and disconnected sections of points that model details may be lost, and points that are incorrectly connected to the mesh surfaces may increase errors in some area. Overall, though, our proposed post-processing refinement step improves the performance ( large gain in earth-mover’s distance with a small cost to Chamfer distance).
4 Reconstruction of Occluded Objects
So far, our proposed point cloud reconstruction approach does not consider foreground occlusions: such as a table in front of a sofa, or a person or pillows on the sofa. Standard approaches do not handle occlusions well since the model does not know whether or where an object is occluded. Starting with an RGB image and an initial silhouette of the visible region, which can be acquired by recent approaches such as Mask-RCNN [16], we propose a 2D silhouette completion approach to generate the complete silhouette of the object. We show that prediction based on the true completed silhouette greatly improves shape prediction and brings performance on occluded objects close to non-occluded; using predicted silhouettes also improves performance, with completed silhouettes outperforming predicted silhouettes of the visible portion.
4.1 Silhouette Completion Network
We assume a detected object, its RGB image crop and the segmentation of the visible region . Our silhouette completion network (Fig. 2) predicts the complete 2D silhouette of the object based on . The network follows an encoder-decoder strategy, gets as input the concatenation of and , and predicts with the same resolution as . The encoder is a modified ResNet-50 and the decoder consists 5 up-sampling layers, producing a single channel silhouette. Intuitively, when occlusion occurs, we want the network to complete silhouette, and when no occlusion occurs, we want the network to predict the original segmentation. We add skip connections to obtain this property. We concatenate the feature after the conv layer of the encoder to the input feature layer of the decoder layer. A full skip connection helps ease training and produces the best results.
Implementation details. For each input image and visible silhouette , we resize them to with preserved aspect ratio and white pixel padded. The value of is re-scaled to and is a binary map with 1 indicating the object. For the encoder, we remove the top fc layer and the average pooling layer of a pre-trained ResNet on ImageNet and obtain a bottleneck feature of . The decoder applies nearest neighbor up-sampling with a scale factor of 2, followed by a convolution layer (kernel size , stride 1) and ReLU. The decoder feature sizes are 2048, 1024, 512, 256, 64 and 1 respectively, with a Sigmoid operation on top. We train the network using binary cross entropy loss between the prediction and the ground truth complete silhouette. We use ADAM to update network parameters with a learning rate of and and batch size 32. Our final prediction is a binary mask obtained by a threshold of 0.5. To account for the truncation of the full object due to the unknown extent of occluded region, we expand the bounding box around each object by 0.3 on each side.
Data augmentation. For each training sample, we perform random cropping on the input image. We crop with a uniformly and randomly sampled ratio ranges between on each side of the input image. Other augmentations include left-right flipping with 50% probability and random rotation uniformly sampled within degrees on the image plane. We also perform image gamma correction, intensity changes and color jittering as in Sec. 3.1.
4.2 Silhouette Guided 3D Reconstruction
Given the predicted complete silhouette , we modify our point cloud reconstruction network to be robust to occlusion. We concatenate our predicted complete silhouette as an additional input channel to the input RGB image , which can effectively guide reconstruction for both partially occluded and non-occluded objects. We show in experiments the importance of using silhouette compared to the approach with no silhouette guidance at all.
Synthetic occlusion dataset. Since there is no large-scale 3D dataset of rendered 2D object image with occlusion and the complete silhouette ground truth available, we propose to generate a synthetic occlusion dataset. Instead of off-line rendering samples which is time consuming and has limited variety of occlusion, we propose a “cut-and-paste” strategy to create random foreground occlusion. Starting with a set of pre-rendered 2D images without occlusion and with known ground truth silhouettes, for each input image , we randomly select another image from the same split (train/val/test) of the dataset as , cutting out the object segment from , pasting and overlaying on the input image . To be more specific, we paste on the location uniformly sampled from , where denotes the top left and bottom right position of the bounding box around the object segment in . are the height and width of the pasted segment which is considered as foreground occlusion. To ease training, we exclude input samples with pasted occlusion covering over 50% of the complete object segment. We perform “cut-and-paste” with 50% probability in training and further add randomly sampled real-scene background with 50% probability. We use the same data augmentation as in Sec. 3.1 to train the network. We penalize the network on the ground truth complete 3D point clouds and the 2D reprojection loss assuming full shape 2D reprojection. We use the ground truth visible and complete silhouettes to train the network.
5 Experiments
5.1 Setup
We verify the following three aspects of our proposed framework: (1) the performance of our reconstruction network compared with the state-of-the-art, the positive impact of surface refinement and reprojection loss (Sec. 5.3); (2) the performance of our silhouette completion network compared with the state-of-the-art (Sec. 5.4); (3) the impact of silhouette guidance with robustness to occlusion (Sec. 5.5). Please find more results in the appendix.
We use ShapeNet to evaluate the overall reconstruction approach. However, since ShapeNet features already-segmented objects, it is not suitable for evaluating silhouette guidance. For that, we use the Pix3D dataset [32], which contains real images of occluded objects with aligned 3D shape ground truth. We consider the following two standard metrics for 3D point cloud reconstruction:
Chamfer distance (CD), defined in Eq. 3.1. 2. 2.
Earth mover’s distance (EMD), defined as follows:
[TABLE]
where denotes a bijection that minimizes the average distance between corresponding points. Since is expensive to compute, we follow the approximation solution as in Fan et al. [7].
For silhouette completion, we evaluate our approach on the synthetic DYCE dataset [6] and compare with the state-of-the-art. DYCE contains segmentation mask for both visible and occluded region in photo-realistic indoor scenes. We also report performance on the Pix3D real dataset since we perform silhouette guided reconstruction on Pix3D. We use the evaluation metric of 2D IoU between the predicted and the ground truth complete silhouette. For detailed comparison, we consider the 2D IoU of the visible potion, the invisible portion and the complete silhouette.
To verify the impact of silhouette guidance, we compare with the following three baselines:
Without silhouette guidance (ours w/o seg); 2. 2.
Guided by visible silhouette (ours w/ vis seg). We use the predicted visible silhouette (semantic segmentation by Mask-RCNN) to guide reconstruction; 3. 3.
Guided by ground truth complete silhouette (ours w/ gt full seg). This shows the upper bound performance of our approach.
5.2 Implementation Details
We implement our network using PyTorch, train and test on a single NVIDIA Titan X GPU with 12GB memory. A single forward pass of the network takes 15.2 ms. The point cloud refinement step is C++ based and takes around 1s on a Linux machine with Intel Xeon 3.5G Hz in CPU mode.
To train our reconstruction approach on ShapeNet, we follow the train-test split defined by Choy et al. [4]. The dataset consists 13 objects classes. Each object has 24 randomly selected views rendered in 2D. We randomly select 10% of the shapes from each class of the train set to form the validation set. The viewer-centered point clouds ground truth is generated by Wang et al. [36]. Since ShapeNet objects are non-occluded, we train our approach without silhouette guidance.
To train the network with silhouette guidance, we construct a synthetic occlusion dataset (Sec. 4.2) based on ShapeNet and add real background from LSUN dataset. We fine-tune the network we previously trained on ShapeNet on our generated synthetic occlusion data. We train the baseline network having predicted visible silhouette as input with the ground truth visible silhouette generated from ShapeNet. We train the baseline network having no silhouette guidance with RGB images as input only. All configurations use the same training settings. It’s worth noting that we do not train reconstructions on Pix3D in any of our experiments to avoid the classification/retrieval problem as discussed by Tatarchenko et al. [33].
For silhouette completion, since Pix3D has fewer images, we first train our approach on the synthetic DYCE dataset. We use DYCE’s official train-val split and use ground truth visible 2D silhouette as input. On Pix3D, we use Mask-RCNN to detect and segment the visible silhouette of the object in each image, then perform completion upon the segmentation. We obtain valid detections with correctly detected object class and a 2D IoU compared to the ground truth bounding box. We use 5-fold cross validation to fine-tune the silhouette completion network pre-trained on DYCE. We further split out 10% val data from each train split in each fold to tune network parameters. Note that images in Pix3D with the same 3D ground truth model are in the same split of either train, val or test.
5.3 Single Image 3D Reconstruction without Occlusion
We show in Fig. 4 sample qualitative results of our reconstructions on ShapeNet. We report in Tab. 1 our quantitative comparison with the state-of-the-art viewer-centered reconstruction approaches. PSG [7] generates point clouds and Pixel2Mesh [36] produces meshes. We sample the same 2466 points as PSG and Pixel2Mesh for fair comparison. Our approach outperforms the two methods on both metrics.
Ablation study. We show in Tab. 2 our performance with different configurations: without both surface-based refinement step and 2D reprojection loss, without refinement step only, without 2D reprojection loss only and our full approach. We observe the improvement with 2D reprojection loss on both CD and EMD. The improvement with 2D loss is less significant when we have the surface-based refinement step, showing the refinement step mitigates some problems with point cloud quality that are otherwise remedied by training with the reprojection loss. The refinement step increases Chamfer distance, mainly due to the smoothed out small and sparse point pieces that model the thin and complex shape structure like railings or handles (Fig. 4, 1st row, 2nd column), and the enhanced error point predictions by connecting sparse point sets (Fig. 4, 3rd row, 2nd column). It’s also worth noting that, even without the post-refinement step, our approach achieves a better CD than PSG and a slightly worse EMD, proving the superiority of our network architecture.
Comparison with object-centered approach. Tab. 3 shows our comparison with the state-of-the-art object-centered point cloud reconstruction approach: 3D-LMNet [27] on ShapeNet. We follow the evaluation procedure as 3D-LMNet, sample 1024 points and re-scale our prediction (and ground truth) to be zero-centered and unit length of 1, then perform ICP to fit to the ground truth. Although our approach targets the more difficult task of joint shape prediction and view point estimation, we achieve a much better EMD and only a slightly worse CD. Note that the reported CD and EMD are of different scales compared to that in Tab. 1, this is because 3D-LMNet takes a squared value when computing CD and the ground truth is re-sized.
5.4 Silhouette Completion
Tab. 4 shows the comparison with the state-of-the-art silhouette completion approach SeGAN [6] on DYCE test set. SeGAN uses ResNet-18 encoder, our approach with ResNet-18 outperforms SeGAN, due to our better up-sampling based decoder that predicts better region beyond occlusion, and the skip connection architecture that preserves the visible region. We use ResNet-50 encoder which yields the best performance to complete silhouettes for the downstream reconstruction task.
Tab. 5 reports our silhouette completion performance on Pix3D. Our approach achieves a better performance than the Mask-RCNN baseline (visible region). We show better performance by fine-tuning our completion approach on Pix3D (“syn+real” v.s. “syn” for training data).
5.5 Robustness to Occlusion
In Fig. 5 we show our qualitative performance on Pix3D for the three object classes that co-occur in ShapeNet. Tab. 6 and Tab. 7 present our quantitative performance of reconstructing occluded objects and non-occluded objects respectively. For evaluation, we use the ground truth point cloud provided by Mandikal et al. [27] and sample the same number of points from our approach for evaluation. Since Pix3D evaluates object-centered reconstruction, we rotate each ground truth shape to have the viewer-centered orientation w.r.t. camera. We then re-scale both ground truth and our prediction to be zero-centered and unit-length, and perform ICP with translation only. We follow the officially provided evaluation metrics of Chamfer distance and Earth mover’s distance by Sun et al. [32]. We compare the performance of our silhouette guided reconstruction “Ours w/ pred full seg” with three baselines: without silhouette guidance, guided with predicted visible silhouette from Mask-RCNN, and guided with ground truth complete silhouette. Compared to the qualitative results on ShapeNet (Fig. 4), we see the challenges of 3D reconstruction given real images due to occlusion and complex background. Using ground truth complete silhouette can make the network be robust to occlusion. Without silhouette guidance, the prediction is difficult because the network does not know whether or where an object is occluded and what to reconstruct; and the network faces the challenges of synthetic to real, since our reconstruction network is only trained on synthetic dataset. Our proposed silhouette guidance is able to bridges the gap between synthetic and real, referring to the large performance boosts from “Ours w/o seg” to other rows. With the guidance of predicted complete silhouettes, our approach is able to narrow down the performance gap between occluded and non-occluded objects, and outperforms the approach with predicted visible silhouettes.
6 Conclusion
We propose a method to reconstruct the complete 3D shape of an object from a single RGB image, with robustness to occlusion. Our point cloud reconstruction approach achieves the state-of-the-art with the major improvement by the surface-based refinement step. We show that, when provided with input ground truth silhouettes, the shape prediction performance is nearly as good for occluded as for non-occluded objects. Using the predicted silhouette also yields large improvements for both occluded and non-occluded objects, indicating that providing an explicit foreground/background separation for the object in RGB images is helpful.
Acknowledgements
This research is supported in part by NSF award 14-21521 and ONR MURI grant N00014-16-1-2007.
Appendix A Illustration of 2D Reprojection Loss
We show in Fig. 6 a detailed description of our 2D reprojection loss from multiple views. For each predicted object, we consider three orthogonal projections based on right-handed coordinates: projection on the y-z plane, projection on the x-z plane and projection on the x-y plane. Note that the projection on the y-z plane is equal to the projection on the image plane (frontal view).
Appendix B Reconstruction Performance with Silhouette Guidance on ShapeNet
Tab. 8 shows our reconstruction performance with silhouette guidance on ShapeNet. Since ShapeNet images [4] use white background and have no foreground occlusion, we obtain silhouettes with image intensity (given image value ranges between ) for both training and testing. With silhouette guidance the overall performance drops slightly. This is because the input image with a white background provides sufficient evidence for the network to capture 3D shape feature, and due to the ambiguity of similar silhouettes from completely different shapes, additional silhouette guidance will confuse the network.
Appendix C Detailed Quantitative Analysis on Pix3D
We evaluate the performance of reconstructing occluded objects under different occlusion ratio. Pix3D [32] has detailed ground truth label of “slightly occluded” and “highly occluded”. We report reconstruction performance classified by these two labels as follows.
Silhouette completion. Tab. 9 reports the performance of completing slightly occluded and highly occluded silhouettes respectively. Completion is easier for slightly occluded objects than highly occluded objects. Our silhouette completion network fine-tuned on Pix3D shows better performance than Mask-RCNN baseline. For sofas, our approach has similar performance for slightly occluded and highly occluded objects. For chairs, completion is difficult due to the complex shape variations beyond occlusions, but we still slightly outperform Mask-RCNN. For tables, our approach is able to have much better completion compared to Mask-RCNN baseline for slightly occluded objects than highly occluded objects. This is because the visible segmentation obtained by Mask-RCNN is not accurate, making silhouette completion difficult.
Silhouette guided point clouds reconstruction. Tab. 10 and Tab. 11 show quantitative results for reconstructing slightly occluded and highly occluded objects in the Pix3D dataset respectively. For all methods, reconstructing slightly occluded chairs and tables is easier than reconstructing highly occluded chairs and tables. For sofas, reconstruction performance is less affected by the occlusion ratio. Overall, our method with predicted complete silhouette performs better than our method guided by visible silhouettes, except for highly occluded chairs that are difficult to predict due to their complex shape beyond occlusion.
Appendix D More Qualitative Results on Pix3D
We show in Fig. 7 and Fig. 8 more qualitative results on Pix3D dataset.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3762–3769, 2014.
- 2[2] A. Bansal, B. Russell, and A. Gupta. Marr revisited: 2d-3d alignment via surface normal prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 5965–5974, 2016.
- 3[3] J. T. Barron and J. Malik. Color constancy, intrinsic images, and shape estimation. In European Conference on Computer Vision , pages 57–70. Springer, 2012.
- 4[4] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r 2n 2: A unified approach for single and multi-view 3d object reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV) , 2016.
- 5[5] M. Desbrun, M. Meyer, P. Schröder, and A. H. Barr. Implicit fairing of irregular meshes using diffusion and curvature flow. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques , pages 317–324. Citeseer, 1999.
- 6[6] K. Ehsani, R. Mottaghi, and A. Farhadi. Segan: Segmenting and generating the invisible. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6144–6153, 2018.
- 7[7] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 605–613, 2017.
- 8[8] P. Favaro and S. Soatto. Seeing beyond occlusions (and other marvels of a finite lens aperture). In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings. , volume 2, pages II–II. IEEE, 2003.
