Recovering 3D Planar Arrangements from Videos

Shuai Du; Youyi Zheng

arXiv:1701.07393·cs.CV·January 26, 2017

Recovering 3D Planar Arrangements from Videos

Shuai Du, Youyi Zheng

PDF

Open Access

TL;DR

This paper introduces a novel optimization framework for reconstructing 3D planar arrangements from videos by leveraging structure-guided dynamic tracking and planar constraints, improving accuracy in camera motion and structure estimation.

Contribution

It presents a new structure-guided dynamic tracking algorithm and a reconstruction pipeline that enforces planar constraints, enhancing 3D reconstruction from videos.

Findings

01

Effective localization of structure correspondence across dense frames

02

Faithful reconstruction of camera motion and 3D structure

03

Improved robustness over traditional point correspondence methods

Abstract

Acquiring 3D geometry of real world objects has various applications in 3D digitization, such as navigation and content generation in virtual environments. Image remains one of the most popular media for such visual tasks due to its simplicity of acquisition. Traditional image-based 3D reconstruction approaches heavily exploit point-to-point correspondence among multiple images to estimate camera motion and 3D geometry. Establishing point-to-point correspondence lies at the center of the 3D reconstruction pipeline, which however is easily prone to errors. In this paper, we propose an optimization framework which traces image points using a novel structure-guided dynamic tracking algorithm and estimates both the camera motion and a 3D structure model by enforcing a set of planar constraints. The key to our method is a structure model represented as a set of planes and their arrangements.…

Figures11

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Statistics of exemplar scenes used in our paper (Figure 1 , 9 ). The fourth column records the number of points to track and the count of user adjustment performed during tracking. The number mainly comes from points in occluded and fuzzy regions where continuous adjustments are required throughout the sequence.

	frames	planes	points (adj.)	time(s)
boxes	300	11	23(70)	380
hall	350	12	38(100)	420
toy house	300	18	35(40)	270
desktop	150	14	36(5)	150
library	500	18	41(50)	330

Equations17

X_{w} = R_{w c} X_{c} + T_{w c},

X_{w} = R_{w c} X_{c} + T_{w c},

x = f \frac{X}{Z},

x = f \frac{X}{Z},

y = f \frac{Y}{Z} .

x^{'} = K [R ∣ t] X^{'}

x^{'} = K [R ∣ t] X^{'}

K^{- 1} x_{1} = R K^{- 1} x_{2} + T .

K^{- 1} x_{1} = R K^{- 1} x_{2} + T .

min i \sum j \sum (x_{i}^{j} - K [R_{i} ∣ t_{i}] X^{j})^{2},

min i \sum j \sum (x_{i}^{j} - K [R_{i} ∣ t_{i}] X^{j})^{2},

e_{ij}^{i} * N_{i} = 0,

e_{ij}^{i} * N_{i} = 0,

N_{a} \cdot N_{b} = 0;

N_{a} \cdot N_{b} = 0;

N_{i} \cdot N_{j} = 1;

e_{ij} \cdot (N_{i} + N_{j}) = 0.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Optical measurement and interference techniques

Full text

Recovering 3D Planar Arrangements from Videos

Shuai Du

ShanghaiTech University

[email protected]

Youyi Zheng

ShanghaiTech University

[email protected]

Abstract

Acquiring 3D geometry of real world objects has various applications in 3D digitization, such as navigation and content generation in virtual environments. Image remains one of the most popular media for such visual tasks due to its simplicity of acquisition. Traditional image-based 3D reconstruction approaches heavily exploit point-to-point correspondence among multiple images to estimate camera motion and 3D geometry. Establishing point-to-point correspondence lies at the center of the 3D reconstruction pipeline, which however is easily prone to errors. In this paper, we propose an optimization framework which traces image points using a novel structure-guided dynamic tracking algorithm and estimates both the camera motion and a 3D structure model by enforcing a set of planar constraints. The key to our method is a structure model represented as a set of planes and their arrangements. Constraints derived from the structure model is used both in the correspondence establishment stage and the bundle adjustment stage in our reconstruction pipeline. Experiments show that our algorithm can effectively localize structure correspondence across dense image frames while faithfully reconstructing the camera motion and the underlying structured 3D model.

1 Introduction

As virtual reality device is becoming more and more popular, 3D content plays a key role associated with these devices. Instead of manually making 3D models using softwares such as 3Ds Max, Maya and Blender, automated 3D reconstruction methods are attracting more attentions due to its high efficiency. By now, a large body of research has been devoted in the field of 3D reconstruction in the aim of producing realistic 3D geometry [6, 7, 12, 22, 25]. However, most of these methods usually output low-level geometric point cloud. These low-level geometry information often lack structure and semantic properties of the 3D content, thus hinders direct usages of these data in the subsequent applications.

In recent, there emerges techniques which exploit high-level structure information such as Manhattan-world assumption [5], CSG representation [32], and repetitions [3] to help the process of 3D reconstruction. A common flavour of these approaches is that the structural information such as Manhattan-world assumptions and repetitions are ubiquitous in manmade scenes and thus can be exploited either at the early stage of analysis [3] or at the later stage of consolidation [5].

In this paper, we propose a semi-automatic method to recover the structured 3D model from video sequences. We focus on reconstruction of structured 3D models from videos captured with inexpensive consumer-level RGB cameras. We base our experiments on the speculation that the underlying structure of the 3D object is essentially hidden in the geometry and can be globally decoupled from the geometry [20]. This could not only give us a more stable bundle estimation but also a structured 3D model. Our key observation is that most manmade environments are constituted of many planar surfaces such as houses and indoor scenes, which inspires us to represent the structured 3D model as a set of planes and their arrangements (e.g., parallel planes and intersecting planes). We devise an optimization framework which simultaneously recovers the plane arrangements and the camera motion as well as the intrinsic relations among the planes.

Start with a video sequence of a 3D model captured with a hand-held consumer camera or downloaded from the internet, we let the user initiate a few planes in one single frame by specifying a sequence of points using mouse clicking (Figure 3). We detect planar regions and represent planes in the form of open or closed polygons depending on the point visibility. Our automatic tracking algorithm is then employed to track each polygon point in the rest frames. The tracking consists of a structure-based optical flow process and a backtracking process via dynamic programming. We then use the triangulation algorithm to calculate the motion of the camera and finally, we use a structure-constrained bundle adjustment algorithm to optimize the plane points by taking into consideration of the inter-relations of the planes.

We test our algorithm on various manmade scene data. By manually providing a sparse set of input (points) in the initial frame, we are able to track the structured points in dense image frames using our tracking algorithm. After the tracking procedure, our method automatically extracts structural relations among planes and uses the arrangement relations as constraints in the bundle adjustment. Constraints that fail to be detected by our algorithm can be further added by the user to provide additional cues to help reconstruct the 3D object. Finally, our algorithm can reconstruct an object satisfying all the constraints.

Our contribution contains two parts. The first part is a structure propagation method. After the user has initially marked out the structure model, we propagate the structure model using a dynamic tracking algorithm, which can relieve the user from marking too many things in order to establish correspondence across the frames. The second part is that we come up with a structure constrained bundle adjustment optimization process, wherein the structural constraints are gradually consolidated during optimization.

We organize this paper’s content as follows. Section $2$ discusses the related works in reconstruction and tracking areas. In Section $3$ and $4$ , we show the propagation step for dense correspondence matches, the reconstruct algorithm and our modified bundle adjustment method. Results are presented in Section 5. And we conclude with the discussion of our method in Section 6.

2 Related Work

A full review of current state-of-the-art 3D reconstruction algorithms is out of the scope of this paper. We refer interested readers to the excellent surveys on stereo vision [23] and multi-view geometry [9, 19]. Below we review the works that are closely related to ours on structure-based tracking and reconstruction.

Point tracking. To track the corresponding points between frames, Sam Hare et al. [8] combine matching and tracking together in a unified optimization formulation. They use their method to detect object and track under a large class of 3d pose or homography transformations. We tried a similar version of the method to track the plane corner points using homography warping, but due to the inherent noise in corner point detection which results in the subsequent Ransac computation of homography matrix $H$ unreliable, the tracking result shows to be impractical.

Another common way to track for corresponding points is to use optical flow. It is a dense field of displacement vectors which defines the translation of each pixel in a region. Popular techniques for computing dense optical flow include methods by Horn and Schunck [10], Lucas et al. [18], and Weinzaepfel et al. [29].

More recent research works include [28, 15, 2, 1]. Wedel et al. [28] explore fundamental matrix priors which favor flows that are aligned with epipolar lines. Lempitsky et al. [15] assume that a number of candidate flow fields have been generated by running standard algorithms possibly multiple times with a number of different parameters. Computing the flow is then posed as choosing which of the set of possible candidates is best at each pixel. And other methods like Brox et al. [2] and Bailer et al. [1] first do a coarse feature matching for large displacement optical flow to refine the result.

To the best of our knowledge, none of the above methods explicitly exploit a structure model to help the tracking process. Our approach utilizes the underlying planar structure to alleviate the instabilities in single point tracking and thus enables a more reliable frame-to-frame tracking.

Structure-based reconstruction. Many structure-based modeling approaches assume there is a structure. These structures include Manhattan-world assumption [5], cuboid assumption [13], CSG representation [32], symmetry [17, 35, 14] and repetitions [3], etc., which are exploited to help regulate and reconstruct the 3D object and to truly interpret the scene.

By giving pre-known constraints in perspective projection, we can recover the 3D information from a single image [36, 14, 26] by calculating the normals. But single image has a very limited field of view, and can not deal with the occlusion without additional symmetry assumption [14, 35].

Mura et el. [21] use clustered 3D range scans to create the structured 3D models of typical interior environments, namely of recognizing their structure of individual rooms and corridors.

By learning the unique features of different types of surfaces and the contextual relationships between them, Xiong et al. [33] propose a method to automatically convert the 3D point data from a laser scanner into a compact, semantically rich information model. And from panorama RGBD images, Furukawa et al. [11] use a graph to represent the internal structures and reconstruct an indoor scene as a structured model.

Relying on raw outputs of traditional multi-view stereo techniques, a structured model can be created and regularized with structural constraints discovered from the point cloud [34, 27]. Such methods could fail once the multi-view stereo methods return degenerated output due to occlusion, reflectance, and bad illuminations, etc.

In contrast, our method couples the process of structure discovering and structure regularization and jointly optimize the plane geometry and plane arrangements, at the cost of a light-weighted initial input of polygon points.

3 The approach

We now detail our algorithm. The main pipeline is devised into two key stages: a structure-based point tracking stage to establish point correspondence across frames and a joint optimization stage where camera motion and planar structures are recovered simultaneously.

3.1 Initialization of the structure model

The input to our algorithm is a video sequence of a 3D model or 3D scene captured by hand-hold cameras or downloaded from the internet. As mentioned before, our goal is to reconstruct the structured 3D model in terms of a set of planes and their arrangements. We represent each plane by a planar polygon.

During initialization, we allow the user to create these polygons manually since automatic detection of planar regions in images is an ill-posed problem and can be easily corrupted by occlusions in cluttered scenes. To create a planar face, the user simply clicks on the image to indicate a corner point of a planar face and then s/he moves the mouse to place the next corner point. Two corner points form a line segment of a planar face. On mouse move, the user sees a highlighted line segment connecting the previous point to the current mouse position (Figure 3). In such setting, the user can position the points more precisely by aligning the line segment with image edges. The user continues the task with existing corner points to create additional points and line segments of planar faces. Newly created line segments share vertices with existing line segments. This process leads to a graph with points as graph nodes and line segments as graph edges, for which we use the automatic planar region detection algorithm [17] to extract planar faces. Figure 3 left shows a snapshot of the interaction process. Since the polygon points are all marked by users, they may be inaccurate. However, this does not affect our point tracking algorithm as our structure-based tracking iteratively improves the points position based on detected structures by minimizing a structure error. This also helps ease users’ work in marking the structures at initialization.

Once the user draws up points and lines to form the structured model that needs to be reconstructed, our system automatically tracks these face points using a structured-guided dynamic tracking algorithm. The user can add additional points if they do not appear in the initial frame due to occlusions (yellow points and edges in Figure 2), our system then automatically tracks the newly added points in the subsequent frames. Occasionally, the user can help the tracking procedure by adjusting the result of some tracking points once occlusion happens or the edges get blurred, and the system will update the intermediate tracking results using dynamic programming. Finally, we will get all the corresponding points to feed into a structure constrained bundle adjustment algorithm.

3.2 Structure-guided Point Tracking

To trace the corner points of all planar faces in the marked image, a straightforward way is to employ optical flow [18, 29]. Unfortunately, a direct tracking with optical flow returns very bad results as it is based on local gradients without paying any attention to the global structure. See Figure 4 for an illustration of results from a direct tracing using optical flow.

We resort to an algorithm that exploits the structure information provided in the initial frame. Instead of directly tracing the points in a local window using optical flow, we make the following key observation: a corner point is an intersecting point of its two or more adjacent line segments in the polygonal faces. While tracking of a single point might lead to undesired positions, tracking of a line segment (a set of points) could be more reliable. To this end, we uniformly sample the points along each of the line segments which intersect at a corner point of the planar faces and track the sampled points from the first frame to the next by optical flow. For each line segment $s_{i}=\{p_{1}^{i},p_{2}^{i},...,p_{n}^{i}\}$ where $p_{k}^{i}$ is the $k$ -th sampled point on $s_{i}$ , each sampled point $p_{k}^{i}$ will have a new position ${p_{k}^{i}}^{\prime}$ , we weight the points with a propagation confidence value computed from the optical flow. We then run a weighted RANSAC algorithm to find the best fitting line associated with the new point positions. Intersection points are updated accordingly which completes the process of corner points tracking. In cases of more than two line intersections, we find the intersection point by weighted least squares. Figure 4 shows an example of the tracing process. Compared to single-point local tracing, our method generates much more reliable results.

Sometimes the result of structured optical flow shifts from the real position, this can be caused by a fuzzy point marked by the user. To relieve this problem, we create a local $3\times{3}$ window $w(c_{i})$ for each tracking point. We consider each pixel as a candidate point, and we trace all pixels using the structure-guided propagation. In specific, each point $p_{j}$ in the local window is connected to the neighboring corner points of $c_{i}$ . This creates new line segments (see an illustration in Figure 5). We then trace these line segments for point $p_{j}$ using our structure-guided propagation. We choose the most confident traced point (measured as summed weights returned from the optical flow) as the new location for the tracking point of $c_{i}$ . See figure 5 for an illustration of the process. Each blue line is a candidate edge in the next frame.

Occasionally, the tracing could still fail if occlusion happens or the object edge blends with the background colors as the camera view changes as shown in Figure 6. We design a back tracing algorithm to address this issue. Once the tracking is failed in one frame, errors will accumulate in all subsequent frames and it is not invertible since we are not aware of at which frame the tracking goes down. Hence, we leave this to the user. If at any stage the user observes that point tracking goes wrong, s/he could simply adjust the point position by reposition of the corner points using mouse dragging. We devise a dynamic programming to automatically adjust the point positions in all the intermediate frames, detailed below.

Given the start point and the end point in two frames $i$ and $j$ , we would like to find a best path connecting these two points and going through points in the intermediate image frames. To increase the possibility of finding the optimal path, as previous, we start from a corner point (pixel) $c_{i}$ in frame $i$ and create a local $3\times{3}$ window $w(c_{i})$ centered at that point. For each point in the window we trace its path along the subsequent frames using the same strategy as mentioned above (i.e., by tracing the line segments). Then each point in window $w(c_{i})$ at frame $i$ will be traced to a point in frame $i+1$ . In frame $i+1$ , we then create for each traced point a local $3\times{3}$ window and repeat the process to frame $i+2$ until we reach frame $j$ .

Note that the above process creates a discrete set of local windows across all frames between $i$ and $j$ , thus guarantees the existence of a valid tracing path. However, such strategy quickly leads to exponential complexity as the size of local windows grows exponentially. To enable efficiency, we need to bound the search locally such that the windows size does not grow too quickly.

We devise a constrained window growing algorithm. We observe that the search region should be small when the frame is close to frame $i$ and becomes larger when it is far away from frame $i$ . This is not surprising due to the nature of camera motion in consecutive frames. Hence, for each intermediate frame $k$ , we restrict local window size to be $\max\{2(k-i)+3,15\},k=i,...,j$ . The center of local window at a frame $k$ (except for frame $i$ and $j$ ) is determined as the weighted center of all traced points from frame $k-1$ (weights are computed from optical flow). Note that this will crop out some traced points that are far away from the center. Figure 7 shows the tracing windows and it can be noted that the positions of window centers vary across frames.

We then establish links across all intermediate frames. Each pixel in the local window at frame $k$ is connected to all pixels in the local windows of frame $k-1$ and frame $k+1$ by creating edges. The weight of each edge is the key to our path finding algorithm. We relate it to the results of structure tracking. Let a point $p_{i}$ denote as the point to trace from frame $i$ and its traced points as $p_{i+1}$ in the next frame. Then the lowest weight is assigned to the link $p_{i}\rightarrow{p_{i+1}}$ and the weight spreads to the neighbor of $p_{i+1}$ in an increasing manner when connecting $p_{i}$ to all points in the local window of frame $i+1$ , that is, the further it spread, the larger the weight becomes. We used $(p_{i+1}-p_{i})^{2}/f_{i}$ as our weight function, where $f_{i}$ is a score returned from the optical flow algorithm. The best tracing path is found as the shortest path connecting points in frame $i$ to frame $j$ using dynamic programming.

4 Reconstruction and Modified Bundle Adjustment

After tracking, we get the whole sequence of images with corresponding points in them. Following the traditional structure from motion pipeline could give us a set of 3D points as well as the rigid camera transformations. However, this will completely ignore all the planar structures of the model, for example, points of planar faces might not lie on the same plane anymore. Hence, we integrate such constraints in our bundle adjustment algorithm. Besides that, additional structure relations such as coplanarity and orthogonality should be added as well. Detecting such relations from pure 2d images is an ill-posed problem due to the lack of 3D information. We thus resort to an iterative optimization approach to analyze such relations from 3D and then re-feed them into the bundle adjustment.

4.1 Image Formation and Camera Motion

For the completeness of exposition, we would like to briefly include some basic notions about the camera model we use and a basic description of the structure from motion pipeline. Interested readers are referred to read more sophisticated contents in the excellent book of [9].

Assume the camera coordinate of frame [math] is the world coordinate $\Gamma_{w}=[r_{1w},r_{2w},r_{3w}]$ and the camera coordinate at frame $c$ is $\Gamma_{c}=[r_{1c},r_{2c},r_{3c}]$ . Given any 3-D point $X$ in the world coordinate, we have $X=\Gamma_{w}X_{w}=\Gamma_{c}X_{c}$ where $X_{w}$ and $X_{c}$ are the local coordinates of $X$ in $\Gamma_{w}$ and $\Gamma_{c}$ respectively (assume $\Gamma_{w}$ and $\Gamma_{c}$ share the same origin, otherwise there will be a translation $T_{wc}$ ). So we have $X_{w}=R_{wc}X_{c}$ , where $R_{wc}=\Gamma_{w}^{-1}\Gamma_{c}$ is a rotation matrix. In general, the coordinates of a 3-D point according to two arbitrary coordinate bases have the following relation:

[TABLE]

where $T_{wc}$ is the translation between the two corresponding coordinate bases.

We know an image is captured through a camera lens. When the aperture is small, the camera model can be regarded as a pinhole camera. In this case, the point x = $[x,y]^{T}$ on the image is given by the following equations:

[TABLE]

Here $f$ is the distance between CCD and aperture, $[X,Y,Z]^{T}$ is the 3D corresponding point of the 2d image. It can be easily derived from similarity geometry.

Usually, the image has $(0,0)$ at its up-left corner. It demands a shift in both $x$ and $y$ axis when assuming the focal point lies at image center. Combining all these together, we can get the relation between camera’s image plane and world’s 3D coordinate:

[TABLE]

where $X^{\prime}=[X,Y,Z,1]^{T}$ and $x^{\prime}=[x,y,1]^{T}$ are now homogeneous coordinates, $K=\left[\begin{array}[]{cccc}f&0&u\\ 0&f&v\\ 0&0&1\\ \end{array}\right]$ is the intrinsic parameter of the camera, $R=R_{cw}$ and $T=T_{cw}$ represent the camera motion related to the world coordinate, and $X^{\prime}$ is the 3D position under the world coordinate system.

4.2 Sfm and Optimization

The relation between two corresponding points $x_{1}$ and $x_{2}$ in two images can be derived from equation 3:

[TABLE]

Each pair of R, T can be derived from the eight-point algorithm [9, 19]. After that, we can get relative $R$ , $T$ after the eight-point algorithm. We need to merge all relative $R$ , $T$ to a world coordinate. Let’s say, the relation between the $1^{st}$ and $2^{nd}$ cameras is $[R_{12}|T_{12}]$ , the relation between the $2^{st}$ and $3^{nd}$ cameras is $[R_{23}|T_{23}]$ . We can easily derive that the relation between the $1^{st}$ and $3^{nd}$ cameras is $[R_{12}[R_{23},T_{23}]|T_{12}]$ .

Finally, we run the optimize process to reduce the error. We consider both the re-projection error and the structure error during the optimization. In detail, the re-projection error is formed as:

[TABLE]

where $x_{i}^{j}$ is the 2d image point of 3D $X^{j}$ in image $i$ .

Besides the re-projection error, we add an additional term to measure the structure error, which is the points on a same planar face should stay coplanar in 3D. This leads to the constraint:

[TABLE]

where $e_{ij}^{i}$ is an edge on plane $i$ , $N_{i}$ is the normal direction of plane $i$ which can be computed from the edges. Optimizing the above equations leads us to a bundle adjustment of camera motion and structured planar faces. Still, there are other structure relations missing, such as coplanarity between two planar faces and orthogonality between two planar faces.

A direct analysis of such relations from 2d is not feasible, thus we detect the coplanarity and orthogonality in the estimated 3D from the above bundle adjustment. We employ a method similar to the method of GlobFit [16]. We detect near orthogonal, coplanar, and parallel plane groups and attempt to enforce them to be orthogonal, coplanar, and parallel. A group of planar faces are detected to be parallel if their normals coincide ( $\leq 10^{\circ})$ . A group of planar faces are detected to be coplanar if their normals coincide and the line connecting their centers is close to orthogonal to their normals. Two parallel groups of planar faces are detected to be orthogonal if the angle between their normals is close to $90^{\circ}(\pm{10^{\circ}})$ . To reduce the ambiguities in detecting these constraints, we first detect parallel groups of planes and take their weighted normal for the subsequent analysis of orthogonality and coplanarity. To detect parallel groups, we apply mean shift [4] on plane normals with a default bandwidth set to $1e-3$ . During the optimization, if any of the group enforcement leads to an increase of error in the bundle adjustment, we release such group constraint. If the automatic detection fails, we let the user to indicate planar relations by clicking on relevant faces. The constraints of orthogonality, parallelism, and coplanarity are in the following forms:

[TABLE]

Here $e_{ij}$ -s are edges connecting points in the two planes. We use Levenberg-Marquardt Algorithm to optimize the sparse camera motion.

5 Experiments

In this section, we experiment using real data to fully evaluate our algorithm. Figure 9 shows the example 3D models and scenes we use in our experiments. They include stacking boxes in a dorm, desktop workspace in a computer lab, corridor inside a building, a toy house, and a school library. They are captured either by a cell phone or by a UAV (the library scene). These models span a typical set of objects in manmade environments and many of them consist of a lot of planar structures which fit perfectly to our algorithm.

When it comes to real data, things get a little different. The first thing is that the camera’s intrinsic parameters are unknown. We solve this by assuming only the focal length of the camera is missing, and we try different values to unproject the 2d image and choose the one with the least error. The second is that real data always has blur due to the camera’s unstable moving and the resolution limitation, which has negative effects on our tracking process, even if the tracking is helped by users. So providing structure constraints will help to improve the optimization result.

Figure 10 shows the reconstruction results and Table 1 shows the statistics of the generated results. We render the reconstructed results using both shaded 3D model and textured model for clear illustration. To texture a planar face, we use a similar technology as in [24]. Our method allows the user to manually adjust the dynamic tracking process once failures were detected (see the point adjustment statistics in Table 1). We observe that such cases happen typically at places where occlusion happens (e.g., the occluded points in the toy house model) or when the structure lines to track blends with the background (Figure 6 and Figure 9 bottom left). We believe these two cases are inherently challenging to handle even with our human perception. We leave it as future work.

Time complexity. Our algorithm consists of a tracking part and a bundle optimization part. The tracking employed a shortest path dynamic programming whose time complexity is $O(kN^{2})$ with $N$ the largest window size and $k$ the number of frames (note that here our graph is layered,thus the complexity is different from a traditional all-pairs shortest path algorithm on a graph $G$ which is known to have an approximate complexity of $O(|V(G)|^{3})$ . The bundle adjustment optimization runs at the same rate as traditional Sfm methods which is super fast in our case, as our input is a sparse set of plane points. It takes less than 1 minute to optimize the toy house model on a laptop with 3.2GHz CPU and 8GB RAM.

Comparison. We conducted a pilot comparison with two state-of-the-art 3D reconstruction methods which we consider to be relevant. The first one is the planar reconstruction method proposed by Li et al. [17] and the other is the famous structure from motion software VisualSFM proposed by Wu et al. [30, 31]. Figure 11 shows their results. Without any structural constraints, the VisualSFm system merely generated a set of incomplete point cloud while the loose symmetry and coplanar constraints used in the method of [17] still cannot guarantee a convergence of the output to a desired one especially when only dealing with a single image (see the drifted faces). Our method faithfully recovers all planar faces and their inter-relations.

Limitation. By now, our method requires the user to specify an initial set of corner points and edges which constitute the planar faces. The initial specification typically takes around 2-5 minutes. This is the main limitation of our algorithm. By far, we are not aware of any automatic algorithms that can robustly identify planar regions from RGB images. An intriguing direction to explore is to use some deep-learning based approaches for detecting planar regions. Another limitation is that our structure-based dynamic tracking could fail at places where the edges get weak or occlusion happens. This is unavoidable due to the inherent noise and motion blur during video capture. A pre-denoise or deblurring process could alleviate the problem a bit but completely solving such problems requires more significant efforts as this needs a semantic understanding of the underlying scene.

6 Conclusion

In conclusion, this paper provides a semi-automatic 3D reconstruction algorithm that recovers a set of structured planes along with their arrangements from a video. Our key contribution is a structure model represented as a set of planes whose arrangements form a faithful description of the scene model. We propose a dynamic point tracking algorithm which explicitly exploits the structure lines as effective means for identifying reliable corner point locations. Besides, a structure-augmented optimization framework with bundle adjustment is introduced to jointly optimize the plane arrangements and the plane geometry. Our future work will consider to combine the traditional Sfm process with automatic structure analysis to enable a fully automated 3D reconstruction pipeline, which we believe will open up new possibilities in the area of structure-based 3D reconstruction and bring potential influence to the community.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. Bailer, B. Taetz, and D. Stricker. Flow fields: Dense correspondence fields for highly accurate large displacement optical flow estimation. In Proceedings of the IEEE International Conference on Computer Vision , pages 4015–4023, 2015.
2[2] T. Brox and J. Malik. Large displacement optical flow: descriptor matching in variational motion estimation. IEEE transactions on pattern analysis and machine intelligence , 33(3):500–513, 2011.
3[3] D. Ceylan, N. J. Mitra, Y. Zheng, and M. Pauly. Coupled structure-from-motion and 3d symmetry detection for urban facades. ACM Trans. Graph. , 33(1):2:1–2:15, 2014.
4[4] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine intelligence , 17(8):790–799, 1995.
5[5] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. Manhattan-world stereo. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on , pages 1422–1429. IEEE, 2009.
6[6] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. Towards internet-scale multi-view stereo. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on , pages 1434–1441. IEEE, 2010.
7[7] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence , 32(8):1362–1376, 2010.
8[8] S. Hare, A. Saffari, and P. H. Torr. Efficient online structured output learning for keypoint-based object tracking. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on , pages 1894–1901. IEEE, 2012.