TAPA-MVS: Textureless-Aware PAtchMatch Multi-View Stereo
Andrea Romanoni, Matteo Matteucci

TL;DR
This paper introduces TAPA-MVS, a multi-view stereo method that improves depth estimation in untextured areas by generating novel hypotheses and adapting photo-consistency measures, resulting in more complete 3D reconstructions.
Contribution
It proposes a textureless-aware PatchMatch approach that enhances depth estimation in untextured regions and refines depth maps for better completeness and accuracy.
Findings
Improved reconstruction completeness in untextured areas.
Enhanced accuracy of depth and normal maps.
Outperforms several state-of-the-art algorithms on ETH3D dataset.
Abstract
One of the most successful approaches in Multi-View Stereo estimates a depth map and a normal map for each view via PatchMatch-based optimization and fuses them into a consistent 3D points cloud. This approach relies on photo-consistency to evaluate the goodness of a depth estimate. It generally produces very accurate results; however, the reconstructed model often lacks completeness, especially in correspondence of broad untextured areas where the photo-consistency metrics are unreliable. Assuming the untextured areas piecewise planar, in this paper we generate novel PatchMatch hypotheses so to expand reliable depth estimates in neighboring untextured regions. At the same time, we modify the photo-consistency measure such to favor standard or novel PatchMatch depth hypotheses depending on the textureness of the considered area. We also propose a depth refinement step to filter wrong…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22
Figure 23
Figure 24
Figure 25
Figure 26
Figure 27
Figure 28
Figure 29
Figure 30
Figure 31
Figure 32
Figure 33
Figure 34
Figure 35
Figure 36
Figure 37
Figure 38
Figure 39
Figure 40| Method | Training sequences | Test sequences | ||||
|---|---|---|---|---|---|---|
| Overall | Low-Res | High-Res | Overall | Low-Res | High-Res | |
| TAPA-MVS (Proposed) | 71.42 | 55.13 | 77.69 | 73.13 | 58.67 | 79.15 |
| OpenMVS | 70.44 | 55.58 | 76.15 | 72.83 | 56.18 | 79.77 |
| ACMH [27] | 65.37 | 51.50 | 70.71 | 67.68 | 47.97 | 75.89 |
| COLMAP [17] | 62.73 | 49.91 | 67.66 | 66.92 | 52.32 | 73.01 |
| LTVRE [10] | 59.44 | 53.25 | 61.82 | 69.57 | 53.52 | 76.25 |
| CMPMVS [8] | 47.48 | 9.53 | 62.49 | 51.72 | 7.38 | 70.19 |
| COLMAP[17] | w/o TW | w/o CS | w/o FS | w/o DR | TAPA-MVS | |||||||||||||
| C | A | F1 | C | A | F1 | C | A | F1 | C | A | F1 | C | A | F1 | C | A | F1 | |
| 1 | 38.65 | 84.34 | 51.99 | 32.68 | 74.40 | 44.58 | 41.72 | 75.30 | 53.18 | 41.35 | 75.10 | 52.86 | 47.78 | 72.13 | 56.31 | 51.66 | 75.37 | 60.85 |
| 2 | 55.13 | 91.85 | 67.66 | 52.57 | 85.70 | 63.08 | 64.13 | 85.98 | 72.54 | 63.69 | 85.77 | 72.26 | 64.27 | 83.32 | 71.84 | 71.45 | 85.88 | 77.69 |
| 5 | 69.91 | 97.09 | 80.5 | 69.31 | 94.08 | 78.62 | 81.08 | 93.69 | 86.68 | 80.84 | 93.58 | 86.51 | 78.62 | 92.51 | 84.37 | 84.83 | 94.31 | 88.91 |
| 10 | 79.47 | 98.75 | 87.61 | 78.10 | 96.91 | 85.64 | 88.80 | 96.53 | 92.38 | 88.61 | 96.45 | 92.22 | 86.33 | 95.94 | 90.47 | 90.98 | 96.79 | 93.69 |
| 20 | 88.24 | 99.37 | 93.27 | 84.93 | 98.34 | 90.53 | 93.64 | 98.12 | 95.77 | 93.61 | 98.05 | 95.72 | 91.26 | 97.75 | 94.25 | 94.72 | 98.23 | 96.38 |
| 50 | 96.03 | 99.70 | 97.78 | 92.07 | 99.30 | 95.19 | 97.33 | 99.23 | 98.25 | 97.54 | 99.20 | 98.34 | 95.65 | 99.21 | 97.23 | 97.60 | 99.30 | 98.41 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
TAPA-MVS: Textureless-Aware PAtchMatch Multi-View Stereo
Andrea Romanoni
Politecnico di Milano, Italy
Matteo Matteucci
Politecnico di Milano, Italy
Abstract
One of the most successful approaches in Multi-View Stereo estimates a depth map and a normal map for each view via PatchMatch-based optimization and fuses them into a consistent 3D points cloud. This approach relies on photo-consistency to evaluate the goodness of a depth estimate. It generally produces very accurate results; however, the reconstructed model often lacks completeness, especially in correspondence of broad untextured areas where the photo-consistency metrics are unreliable. Assuming the untextured areas piecewise planar, in this paper we generate novel PatchMatch hypotheses so to expand reliable depth estimates in neighboring untextured regions. At the same time, we modify the photo-consistency measure such to favor standard or novel PatchMatch depth hypotheses depending on the textureness of the considered area. We also propose a depth refinement step to filter wrong estimates and to fill the gaps on both the depth maps and normal maps while preserving the discontinuities. The effectiveness of our new methods has been tested against several state of the art algorithms in the publicly available ETH3D dataset containing a wide variety of high and low-resolution images.
1 Introduction
Multi-View Stereo (MVS) aims at recovering a dense 3D representation of the scene perceived by a set of calibrated images, for instance, to map cities, to create a digital library of cultural heritage or to help robots navigating an environment. Thanks to the availability of public datasets [20, 23, 9], several successful MVS algorithms have been proposed in the last decade, and their performance keeps increasing.
Depth map estimation represents one of the fundamental and most challenging steps on which most MVS methods rely. Depth maps are then fused together directly into a point cloud [29, 17], or into a volumetric representation, such as a voxel grid [16, 3] or Delaunay triangulation [11, 25, 10, 14]. In the latter case a 3D mesh is extracted and can be further refined via variational methods [25, 2, 13] and eventually labelled with semantics [15].
Although Machine Learning methods have begun to appear [7, 26, 28], PatchMatch-based algorithms, emerged some years ago, are still the top performing approaches for efficient and accurate depth map estimation. The core idea of PatchMatch, pioneered by Barnes et al. [1] and extended for depth estimation by Bleyer et al. [4], is to choose for each pixel a random guess of the depth and then propagate the most likely estimates to their neighborhood. Starting from this idea Schönberger et al. [17] recently proposed a robust framework able to jointly estimate the depth, the normals, and the pixel-wise camera visibility for each view.
One of the major drawbacks of PatchMatch methods is that most of the untextured regions are not managed correctly (Figure 1(b)). Indeed the optimization highly relies on the photometric measure to discriminate which random estimate is the best guess and to filter out unstable estimates. The depth of the untextured regions is hard to be defined with enough confidence since they are homogeneous and thus, the photometric measure alone hardly discerns neighboring regions.
In this paper, we specifically address the untextured regions drawback by leveraging on the assumption that untextured regions are often piecewise flat (Figure 1(d)). The framework presented, named TAPA-MVS, proposes:
- •
a metric to define the textureness of each image pixel; it serves as a proxy to understand how much the photo-consistency metric is reliable.
- •
to subdivide the image into superpixels and, for each iteration of the optimization procedure, to fit one plane for each superpixel; for each pixel, a new depth-normal hypothesis is added and evaluated into the optimization framework considering the likelihood of the plane fitting procedure.
- •
a novel depth refinement method that filters the depth and normal maps and fills each missing estimates with an approximate bilateral weighted median of the neighbors.
We tested the proposals against the 38 sequences of the publicly available ETH3D dataset [18] (Section 6) and the results show that our method is able to significantly improve the completeness of the reconstruction while preserving a very good accuracy.
In the following after a brief introduction to PatchMatch based methods (Section 2), we review the COLMAP framework by Schönberger et al. [17] (Section 3). Section 4 and 5 described the proposed texture-aware PatchMatch hypotheses generation and the depth map refinement. Section 6 illustrates the experimental results.
2 Patch-Match for Multi-View Stereo
The PatchMatch seminal paper by Barnes et al. [1] proposed a general method to efficiently compute an approximate nearest neighbor function defining the pixelwise correspondence among patches of two images. The idea is to use a collaborative search which exploits local coherency. PatchMatch initializes each pixel of an image with a random guess about the location of the nearest neighbor in the second image. Then, each pixel propagates its estimate to the neighboring pixels and, among these estimates, the most likely is assigned to the pixel itself. As a result the best estimates spread along the entire image.
Bleyer et al. [4] re-framed this method into the stereo matching realm. Indeed, for each image patch, stereo matching looks in the second image for the corresponding patch, i.e. the nearest neighbor in the sense of photometric consistency. To improve its robustness the matching function is not limited to fixed sized squared windows, but it extends PatchMatch to estimate a pixel-wise plane orientation adopted to define the matching procedure on slanted support windows. Heise et al. [6] integrated the PatchMatch for stereo into a variational formulation to regularize the estimate with quadratic relaxation. This approach produces smoother depth estimates while preserving edges discontinuities.
The previous works successfully applied the PatchMatch idea to the pair-wise stereo matching problem. The natural extension to Multi-View Stereo was proposed by Shen [22]. Here the author selects a subset of camera pairs depending on the number of shared points computed by Structure from Motion and their mutual parallax angle. Then he estimates a depth map for the selected subset of camera pairs through a simplified version of the method of Bleyer et al. [4]. The algorithm refines the depth maps by enforcing consistency among multiple views, and it finally merges the depth maps into a point cloud.
A different multi-view approach by Galliani et al. [5] modifies the PatchMatch propagation scheme in such a way that computation can better exploit the parallelization of GPUs. Differently, from Shen [22], they aggregate, for each reference camera, a set of matching costs compute from different source images. One of the major drawbacks of these approaches is the decoupled depth estimation and camera pairs selection. Xu and Tao [27] recently proposed an attempt to overcome this issue; they extended [5] with a more efficient propagation pattern and, in particular, their optimization procedure jointly considers all the views and all the depth hypotheses.
Rather than considering the whole set of images to compute the matching costs, Zheng et al. [29] proposed an elegant method to deal with view selection. They designed a robust method framing the joint depth estimation and pixel-wise view selection problem into a variational approximation framework. Following a generalized Expectation Maximization paradigm, they alternate depth update with a PatchMatch propagation scheme, keeping the view selection fixed, and pixel-wise view inference with the forward-backward algorithm, keeping the depth fixed.
Schönberger et al. [17] extended this method to jointly estimate per-pixel depths and normals, such that, differently from [29], the knowledge of the normals enables slanted support windows to avoid the fronto-parallel assumption. Then they add view-dependent priors to select views that more likely induce robust matching cost computation.
The PatchMatch based methods described thus far, have been proven to be among the top performing approachs in several MVS benchmarks [21, 23, 9, 19]. However, some issues are still open. In particular, most of them strongly rely on photo-consistency measures to discriminate among depth hypotheses. Even if this works remarkably for textured areas and the propagation scheme partially induces smoothness, untextured regions are often poorly reconstructed. For this reason, we propose two proxies to improve the reconstruction where untextured areas appear. On the one hand, we seamlessly extend the probabilistic framework to explicitly detect and handle untextured regions by extending the set of PatchMatch hypotheses. On the other side, we complete the depth estimation with a refinement procedure to fill the missing depth estimates.
3 Review of the COLMAP framework
In this section we review the state-of-the-art framework proposed by Schönberger et al. [17] which builds on top of the method presented by Zheng et al. [29]. Let note that in the following, we express the coordinate of the pixel only with a value , since both frameworks sweep independently every single line of the image alternating between rows and columns.
Given a reference image and a set of source images , the framework estimates the depth and the normal of each pixel , together with a binary variable , which indicates if is visible in image . This is framed into a Maximum-A Posteriori (MAP) estimation where the posterior probability is:
[TABLE]
where is the number of pixels considered in the current line sweep, and . The likelihood term
[TABLE]
represents the photometric consistency of the patch , which belongs to a non-occluded source image and is around the pixel corresponding to the point at , with respect to the patch around in the reference image. The photometric consistency is computed as a bilaterally weighted NCC, and the constant cancels out in the optimization. The likelihood term represents the geometric consistency and enforces multi-view depth and normal coherence. Finally favors image occlusion indicators which are smooth both spatially and along the successive iteration of the optimization procedure.
Being Equation (1) intractable, Zheng et al. [29] proposed to use variational inference to approximate the real posterior with a function such that the KL divergence of the two functions is minimized. Schönberger et al. [17] factorize and, to estimate such approximation, they propose a variant of the Generalized Expectation-Maximization algorithm [12]. In the E step, the values are kept fixed, and, in the resulting Hidden Markov Model, the function is computed by means of message passing. In the M step, viceversa, the values of are fixed, the function is constrained to the family of Kroneker delta functions . The new optimal values of and are computed as:
[TABLE]
where is a subset of sources images, randomly sampled according to a probability . Probability favors images not occluded, and coherent with three priors which encourage good inter-cameras parallax, similar resolution and camera, front-facing the 3D point defined by .
According to the PatchMatch scheme proposed in [17], the pair evaluated in Equation (3) is chosen among the following set of hypotheses:
[TABLE]
where comes from the previous iteration, is the estimate from the previous pixel of the scan, is a random hypothesis and finally, and are two small perturbations of the estimates and .
4 Textureness-Aware Joint PatchMatch and View Selection
The core ingredient that makes a Multi-View Stereo algorithm successful is the quality and the discriminative effectiveness of the stereo comparison among patches belonging to different cameras. Such comparison relies on a photometric measure, computed as Normalized Cross Correlation or similar metrics such as Sum of Squared Differences (SSD), or Bilateral Weighted NCC. The major drawback arises in correspondence of untextured regions. Here the discriminative capabilities of NCC become unreliable because all the patches belonging to the untextured area are similar among each other.
Under these assumptions, the idea behind our proposal is to segment images into superpixels such that each superpixel would span a region of the image with a texture mostly homogeneous and it likely stops in correspondence to an image edge. Then, we propagate the depth/normal estimates belonging to photometrically stable regions around the edges to the entire superpixel. In the following we assume the first iteration of the framework presented in Section 3 is executed so that we have a very first estimation of the depth map, which is reliable only in correspondence of highly textured regions (Figure 2).
4.1 Piecewise Planar Hypotheses generation
The idea of the method is to augment the set of PatchMatch depth hypotheses in Equation 4 with novel hypotheses that model a piecewise planar prior corresponding to untextured areas.
In the first step we extract the superpixels of each image by means of the algorithm SEEDS [24]. Since, a superpixel generally contains homogeneous texture, we assume that each pixel covered by a superpixel roughly belongs to the same plane.
After running the first iteration of depth estimation, we filter out the small isolated speckles of the depth map obtained (in this paper, with area smaller than ). As a consequence, the area of in the filtered depth map likely contains a set of reliable 3D points estimates which roughly corresponds to real 3D points. In the presence of untextured regions, these points mostly belong to the areas near edges (Figure 2).
We fit a plane on the 3D points in with RANSAC, classifying the points farther than 10 cm from the plane as outliers. Let us define the tentative depth hypothesis for a pixel corresponding to the 3D point on the plane and the corresponding plane normal (Figure 3) Then, let us define the inlier ratio , whose value expresses the confidence of the plane estimate.
The actual hypotheses for a pixel is generated as follows. To deal with fitting uncertainty, we first define ; so that if the value sampled from a uniform distribution is then . To propagate the hypotheses from superpixels with good inlier ratio to the neighbors with bad one, if the value of is sampled from the neighboring superpixels belonging to a set . Since we aim at spreading the depth hypotheses among superpixels with a similar appearance, we sample from proportionally to the Bhattacharya distance among the RGB histograms of and the elements of .
Experimentally, we noticed that the choice of , i.e., the number of superpixels, influences how the untextured areas are treated and modeled in our method. With small values of large areas of the images are nicely covered, but at the same time, limited untextured regions are improperly fused. Vice-versa, a big better models small regions while underestimating large areas. For this reason, we choose to adopt both a coarse and a fine superpixel segmentation of the image such that both small and large untextured areas are modeled properly. Therefore, for each pixel, we generate two depth hypotheses: and . In our experiments we choose and .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics-TOG , 28(3):24, 2009.
- 2[2] M. Blaha, M. Rothermel, M. R. Oswald, T. Sattler, A. Richard, J. D. Wegner, M. Pollefeys, and K. Schindler. Semantically informed multiview surface refinement. International Journal of Computer Vision , 2017.
- 3[3] M. Blaha, C. Vogel, A. Richard, J. D. Wegner, T. Pock, and K. Schindler. Large-scale semantic 3d reconstruction: an adaptive multi-resolution model for multi-class volumetric labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3176–3184, 2016.
- 4[4] M. Bleyer, C. Rhemann, and C. Rother. Patchmatch stereo-stereo matching with slanted support windows. In BMVC , volume 11, pages 1–11, 2011.
- 5[5] S. Galliani, K. Lasinger, and K. Schindler. Massively parallel multiview stereopsis by surface normal diffusion. The IEEE International Conference on Computer Vision (ICCV) , June 2015.
- 6[6] P. Heise, S. Klose, B. Jensen, and A. Knoll. Pm-huber: Patchmatch with huber regularization for stereo matching. In Computer Vision (ICCV), 2013 IEEE International Conference on , pages 2360–2367. IEEE, 2013.
- 7[7] P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang. Deepmvs: Learning multi-view stereopsis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2821–2830, 2018.
- 8[8] M. Jancosek and T. Pajdla. Multi-view reconstruction preserving weakly-supported surfaces. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , pages 3121–3128. IEEE, 2011.
