Stratified Labeling for Surface Consistent Parallax Correction and   Occlusion Completion

Jie Chen; Lap-Pui Chau; Junhui Hou

arXiv:1903.02688·cs.CV·March 26, 2019

Stratified Labeling for Surface Consistent Parallax Correction and Occlusion Completion

Jie Chen, Lap-Pui Chau, Junhui Hou

PDF

Open Access

TL;DR

This paper introduces a stratified synthesis approach combined with a GAN model to improve light field view synthesis, especially for large perspective shifts, achieving significant quality enhancements over existing methods.

Contribution

The paper presents a novel stratified disparity layer-based synthesis strategy integrated with a GAN for enhanced parallax correction and occlusion completion in light field view synthesis.

Findings

01

Achieves over 3dB quality improvement compared to state-of-the-art methods.

02

Provides more reliable novel view synthesis at large baseline extensions.

03

Effectively preserves scene structures over large perspective shifts.

Abstract

The light field faithfully records the spatial and angular configurations of the scene, which facilitates a wide range of imaging possibilities. In this work, we propose an LF synthesis algorithm which renders high quality novel LF views far outside the range of angular baselines of the given references. A stratified synthesis strategy is adopted which parses the scene content based on stratified disparity layers and across a varying range of spatial granularities. Such a stratified methodology proves to help preserve scene structures over large perspective shifts, and it provides informative clues for inferring the textures of occluded regions. A generative-adversarial network model is further adopted for parallax correction and occlusion completion conditioned on the stratified synthesis features. Experiments show that our proposed model can provide more reliable novel view synthesis…

Tables1

Table 1. Table 1: Quantitative evaluation and comparison for target SAIs synthesized by different algorithms at different baseline extension ratios.

Testing Data	Expan. Ratio	LVS-TOG16[14]		SAA-ECCV18 [38]		POBR-TIP18 [4]		SL (w/o SC)		SLSC
		PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
Art Gallery2 zoom	5x	19.02	0.82	23.17	0.85	22.60	0.89	24.42	0.91	28.35	0.92
	7x	18.25	0.80	19.76	0.81	20.70	0.87	22.92	0.89	25.38	0.90
	9x	17.13	0.76	17.93	0.78	19.36	0.85	20.558	0.87	23.28	0.87
Bikes	5x	18.90	0.69	23.75	0.59	22.94	0.80	25.37	0.86	26.87	0.88
	7x	18.49	0.64	21.52	0.66	21.23	0.74	24.23	0.83	25.48	0.83
	9x	17.98	0.59	19.93	0.73	20.12	0.69	22.54	0.78	23.85	0.78
Furniture1	5x	26.72	0.96	34.56	0.96	29.06	0.97	30.12	0.97	31.11	0.98
	7x	26.45	0.95	32.05	0.95	28.12	0.96	29.50	0.97	30.29	0.97
	9x	24.88	0.92	28.91	0.93	27.27	0.95	28.44	0.96	29.30	0.96
Workshop	5x	19.89	0.78	26.36	0.82	24.32	0.91	26.76	0.94	30.30	0.95
	7x	19.26	0.72	23.29	0.75	22.56	0.86	25.82	0.93	28.03	0.92
	9x	18.00	0.65	20.66	0.64	21.32	0.82	23.77	0.89	26.46	0.89

Equations25

I^{t} = R [I_{- M}, \cdot \cdot \cdot, I_{0}, \cdot \cdot \cdot, I_{M}, t], ∣ t ∣ ≫ M .

I^{t} = R [I_{- M}, \cdot \cdot \cdot, I_{0}, \cdot \cdot \cdot, I_{M}, t], ∣ t ∣ ≫ M .

I_{v}^{t} (x, y) =

I_{v}^{t} (x, y) =

=

S_{v}^{l} (x, y) = {D_{v} (x, y), 0, (x, y) \in Ω_{v}^{l} otherwise .

S_{v}^{l} (x, y) = {D_{v} (x, y), 0, (x, y) \in Ω_{v}^{l} otherwise .

I_{v}^{t, l} = {W [I_{v} (x, y), (t - v) \cdot S_{v}^{l}], 0, (x, y) \in Ω_{v}^{l} otherwise .

I_{v}^{t, l} = {W [I_{v} (x, y), (t - v) \cdot S_{v}^{l}], 0, (x, y) \in Ω_{v}^{l} otherwise .

\tilde{I}_{v}^{t} = l \sum [I_{v}^{t, l} (x, y) \cdot R (x, y, l)], l = 1, 2, \cdot \cdot \cdot, L .

\tilde{I}_{v}^{t} = l \sum [I_{v}^{t, l} (x, y) \cdot R (x, y, l)], l = 1, 2, \cdot \cdot \cdot, L .

\tilde{I}_{v}^{t} = F_{L} [I_{v}, D_{v}, t - v] .

\tilde{I}_{v}^{t} = F_{L} [I_{v}, D_{v}, t - v] .

V_{d}^{t} = v \sum \tilde{I}_{v}^{t} / (2 \cdot M + 1) .

V_{d}^{t} = v \sum \tilde{I}_{v}^{t} / (2 \cdot M + 1) .

W^{t} (x, y) = F_{L} [D_{0} (x, y), D_{0} (x, y), t] .

W^{t} (x, y) = F_{L} [D_{0} (x, y), D_{0} (x, y), t] .

V_{s p}^{s, t} = F_{L} [I_{0}, P_{0}^{s}, t] .

V_{s p}^{s, t} = F_{L} [I_{0}, P_{0}^{s}, t] .

T_{1} (x, y) =

T_{1} (x, y) =

T_{2} (x, y) =

T_{3} (x, y) =

L = L_{L_{1}} + λ L_{G A N} .

L = L_{L_{1}} + λ L_{G A N} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Image Enhancement Techniques

Full text

Stratified Labeling for Surface Consistent Parallax Correction

and Occlusion Completion

Jie Chen1, Lap-Pui Chau1, and Junhui Hou2

1ST Engineering - NTU Corporate Lab, Nanyang Technological University, Singapore

2Department of Computer Science, City University of Hong Kong

Abstract

The light field faithfully records the spatial and angular configurations of the scene, which facilitates a wide range of imaging possibilities. In this work, we propose an LF synthesis algorithm which renders high quality novel LF views far outside the range of angular baselines of the given references. A stratified synthesis strategy is adopted which parses the scene content based on stratified disparity layers and across a varying range of spatial granularities. Such a stratified methodology proves to help preserve scene structures over large perspective shifts, and it provides informative clues for inferring the textures of occluded regions. A Generative-Adversarial network model is further adopted for parallax correction and occlusion completion conditioned on the stratified synthesis features. Experiments show that our proposed model can provide more reliable novel view synthesis quality at large baseline extension ratios. Over 3dB quality improvement has been achieved against state-of-the-art LF view synthesis algorithms.

1 Introduction

With the commercialization of light field cameras such as Lytro [21] and Raytrix [24], light field imaging has become a popular topic and is attracting extensive research and industrial attentions. The light field (LF) is a vector function that describes the amount of light propagating in every direction through every point in free space [19]. Compared with conventional 2D cameras, LF cameras can capture extra directional information for each light ray, and such information enables exciting applications such as refocusing, 3D scene reconstruction [16, 23], material recognition [33], reflection/specularity removal [22, 28], and virtual/augment reality display [9], to just name a few.

One of the promising applications for LF imaging is in real-footage acquisition and immersive display of 3D environments. Compared with conventional stereo system which uses the binocular disparity for the brain to perceive the depth, LF allows to display correct scene perspectives from any viewing angles with realistic presentation of details such as shading, specularity, and focus shift. With the latest development of LF display technology [27, 10] that aims at achieving glasses-free and fatigue-free immersive 3-D perception, LF is now considered as a promising future media for 3-D telepresence and virtual/augmented/mixed reality applications. One of the greatest technical challenges for these applications, is the extremely bulky size of LF data, which requires a great effort in the data acquisition, and a large bandwidth for transmission. This gives rise to a lot of research efforts focused on compact representation and compression technologies of the LF data [8, 3].

Another perspective of attacking this limitation is to synthesize novel LF views via a post-processing algorithm given small amount of LF information. Research works are seen on the interpolation of densely sampled LF views based on a sparse set of inputs [14, 34, 36]. In-between views could be smoothly synthesized which greatly reduces the amount of data required for transmission at real-time scenarios.

Besides view interpolation, the concept of view extrapolation or baseline extension presents another promising direction. The need for an expanded LF/camera baseline is multi-fold. First, it gives a wider viewing angle which provides more visual information to the viewer. A larger baseline facilitates subsequent computational procedures: higher sub-pixel precision could be achieved for disparity estimation; more dramatic refocusing effect could be achieved. Most importantly, it gives more free-viewing angles for immersive display applications, without increasing data transmission bandwidth.

In this work, we will work on the problem of synthesizing novel LF views that is far outside of the current LF baseline. We propose a stratified synthesis strategy which parses the scene content based on stratified disparity layers and across a varying range of spatial granularities. Such stratified methodology proves to be helpful in preserving scene content structures over large perspective shifts, and it provides informative clues for inferring the textures of occluded regions.

2 Related Work

Light Field View Interpolation. Closely related to the problem of LF baseline extension, LF view interpolation aims at synthesizing the in-between LF views based on a sparse set of reference input views. Methods for LF view interpolation can be divided into two categories: The first category requires explicit estimation of the scene disparity to guide the view synthesis [30, 4, 14]. Notably, Kalantari et al. [14] proposed to use the disparity estimated by a convolutional neural network (CNN) as guide to warp all reference views to the target angle, these warped views are subsequently fed into a second CNN for color refinement until a final prediction is reached. Bicubic interpolation was adopted during warping which links the gradients of errors from the synthesized views with those from the disparity estimation, which enables end-to-end training. Methods under this category is either limited by the precision of disparity estimation, or by the warping operators which only samples over a limited local area and fails to back-propagate synthesis errors over larger ranges.

Methods in the second category directly synthesize the target view via exploration of the geometrical features embedded within the EPIs [36], across stacked sub-aperture views along different directions [26], or via alternating filtering between the spatial-angular domains [38]. Although methods under this category generally produce more realistic renderings, their performance is still limited for scenarios with large camera baselines.

Parallax Magnification via Layered Depth Images Scene content at different depth show complex occlusion relationships when viewed from distant viewing angles. It is a well-adopted strategy to segment the pixels into separate layers based on their motion [20] and disparity range [7, 39] for independent processing. Notably, Zhou et al. [39] proposed an end-to-end training framework that first leans the Multi-Plane Images (MPI) with increasing depth ranges. The inferred MPIs are then used to synthesize a range of novel views via a subsequent module. Such a framework preserves the scene geometry in an computationally efficient way, however it produces noticeable distortions when the parallax shift is large; and it is beyond its capability to deal with large ambiguous regions caused by occlusion.

Context-Aware Image Inpainting. The challenge of predicting ambiguous image content caused by occlusion is directly related to the problem of image synthesis and inpainting. The Generative-Adversarial Network (GAN) [6] show wonderful performance for these applications. Semantic labels have been used to provide content consistent outcomes for image synthesis [12, 32]. Global and local context features have been adopted to provide a more stable and consistent inpainting [11, 37]. These works provide valuable inspirations to our work, in which we aim at inpainting the occluded regions with surface consistent constraints.

In this work, we aim at the task of synthesizing high quality novel LF views far outside the range of angular baselines of the given reference LF. Different from LF view interpolation or near field view extrapolation, this is a more challenging problem in the following sense:

The variations between the target view and the given references are much larger. A typical LF view interpolation problem involves pixel translation/scaling in the order of several pixels; however baseline extension involves translation/scaling in the order of more than 10 pixels. Even small inaccuracy in the disparity estimation could lead to obvious distortions. 2. 2.

The occlusion relationships among the scene content need to be more accurately and robustly modeled. Objects at different distances from the camera will have dramatically different occlusion relationships when the viewing angle is significantly changed. 3. 3.

The handling of occluded areas is much more challenging as significant changes in viewing angles could expose large un-seen areas from the reference views. How to infer the content of these occluded area based on the spatial/angular context is an important issue to be addressed.

3 Proposed Method

The LF describes the intensity and direction of every light ray that propagates within an imaging system, and it is usually parameterized as a 4D variable $\mathcal{L}(x,y,s,t)$ : with $(x,y)$ representing the spatial domain, while $(s,t)$ representing the angular domain [21]. In this work, we limit the LF angular domain $(s,t)$ to be along horizontal direction for the purpose of simplifying computational complexity. However, extension to other angular directions or higher angular dimensionality is straightforward.

Suppose we have a densely sampled LF $L(x,y,s)$ with a limited angular baseline, which is spanned by an horizontal array of $(2\cdot M+1)$ Sub-Aperture Images (SAI) [21]. These SAIs look at the scene from slightly different, equidistant viewing angles [3]. We denote these SAIs as $\{I_{v}(x,y)|v=(-M,-M+1,\cdot\cdot\cdot,0,\cdot\cdot\cdot,M-1,M)\}$ , with $v$ indicating the SAI location with respect to the central SAI $I_{0}(x,y)$ .

Given such a limited angular baseline of $[-M,M]$ , the purpose of our framework is to expand the LF baseline, and synthesize novel SAIs far outside of the current angular limit:

[TABLE]

Here we use $\mathcal{R}$ to represent the mapping between the input LF reference SAIs $\{I_{-M},\cdot\cdot\cdot,I_{M}\}$ and the target novel SAI $I^{t}$ . To deal with the challenges discussed in Sec. 2, we propose a novel SAI synthesis strategy that significantly extends camera baselines based on a Generative-Adversarial Network (GAN). The framework benefits from combining informative geometrical clues via transformations based on stratified disparity layers and spatial granularities.

The system diagram of our proposed framework is shown in Fig. 3, named Stratified Labeling for Surface Consistent parallax correction and occlusion completion (SLSC). The stratification is implemented at two levels: first, the scene is divided into different disparity layers. For a target SAI $I^{t}$ , reference SAIs are warped to the desired angle in a layer-wise manner. These warped layers are then fused which efficiently preserves the occlusion relationships. Second, we segment the reference SAIs into different spatial granularities in forms of Superpixels (SP). Pixels within the same SP will be transformed identically, which provides un-distorted content reference within the granularity level. These informative clues from stratified transform operations will be combined and fed into to a generative network, which corrects parallax distortions, and completes occluded regions. The details of the proposed SLSC framework will be discussed in the following.

3.1 Stratified Disparity Rendering for Structure Projection

3.1.1 Projection over Stratified Disparity Layers

With recent advancement of LF disparity estimation algorithms [35, 13, 31, 4], given a densely sampled LF, an accurate disparity map with sub-pixel precision can be efficiently calculated. We adopt the algorithm by Chen et al. [4] as our disparity estimator, since it shows the advantage in preserving the contours of complex occluding structures, which are most vulnerable to distortions over large perspective transforms. Additionally, the method also works on a superpixel granularity, which fits well into our granulated spatial operations.

We define the operator $\mathcal{W}[I,D]$ which warps the image $I$ according to the disparity map $D$ . Bicubic interpolation [15] will be used for sub-pixel locations. Suppose we have a disparity estimation $D_{v}(x,y)$ for each reference LF SAI $I_{v},~{}v\in[-M,M]$ using the method from [4]. Along with $D_{v}(x,y)$ , we also get an estimation confidence map $C_{d}(x,y)$ . The target view $I^{t}(x,y)$ can be calculated based on the reference SAI $I_{v}(x,y)$ according to:

[TABLE]

The subscript $v$ in $I_{v}^{t}(x,y)$ indicates the warping outcome is based on the input reference view $I_{v}$ .

In order to preserve the scene geometry and its occlusion relationships over a significant angular transform, we implement a layer-wise warping and fusion scheme based on stratified disparity maps. The idea of layer-wise warping has been implemented in earlier works [25] [39]. We further extend this strategy into a multi-granularity framework, which efficiently provides complementary information over distortion-prone areas.

We divide the disparity variation range of $D_{v}(x,y)$ into $L$ equal intervals. Pixels with disparity values that fall within the same interval are grouped as one layer. We denote these $L$ layer groups as $\{\Omega_{v}^{l}|l=1,2,\cdot\cdot\cdot,L\}$ , with the superscript $l$ denoting the layer index. Consequently, a layered volume of disparity maps $\{\mathcal{S}_{v}^{l}|l=1,2,\cdot\cdot\cdot,L\}$ can be created according to:

[TABLE]

A visualization of $\mathcal{S}_{v}^{l}(x,y)$ can be seen in Fig. 1.

With similar disparity values, it can be assumed that pixels of the same layer $\mathcal{S}_{v}^{l}$ will not alter occlusion relationships after angular transform. Warping to the target SAI $I_{v}$ is carried out for each layer independently according to:

[TABLE]

Consequently, a stratified volume of target SAIs $\{I_{v}^{t,l}|l=1,2,\cdot\cdot\cdot,L\}$ can be calculated.

3.1.2 Layer Volume Fusion

Given the layer-wise warped volume $\{I_{v}^{t,l}|l=1,2,\cdot\cdot\cdot,L\}$ , we combine them to synthesize a complete target SAI.

Starting from disparity layer ( $l=L$ ) that corresponds to the furthest scene distance range, pixels from $\{I_{v}^{t,L}\}$ will occupy the corresponding pixels in $\tilde{I}_{v}^{t}$ . Subsequently, nearer layers will be processed accordingly. When there is a pixel occupation conflict, nearer layers will replace the pixels from further layers. We use the tensor $R(x,y,l)$ to represent such fusion logic. For each pixel location $(x^{\prime},y^{\prime})$ , only one layer of $R(x^{\prime},y^{\prime},l)$ along $l=1,2,\cdot\cdot\cdot,L$ are set to 1, with all the rest set to zeros. The non-zero layer corresponds to the layer closest to camera $\{I_{v}^{t,l}|l=1,2,\cdot\cdot\cdot,L\}$ with non-zero pixel contents. The final image can be calculated as:

[TABLE]

Here we define an operator $\mathcal{F}_{L}[I,D,t^{\prime}]$ which incorporates the above mentioned layer-wise warping plus layer fusion for the synthesis of novel SAIs, and we name such an operator Stratified Disparity Rendering (SDR). $\mathcal{F}_{L}[I,D,t^{\prime}]$ functions to synthesize a novel view based on a reference SAI $I$ , and its disparity map $D$ . $t^{\prime}$ indicates the angular shift distance for the target view (in units of angular distance between neighboring reference SAIs). The subscript $L$ indicates number of stratified layers. Now the SDR process defined in Eqs. (4), (5), and (6) can be combined and re-written as:

[TABLE]

Since we have one prediction $\tilde{I}_{v}^{t}$ based on every reference SAI $v=-M,\cdot\cdot\cdot,0,\cdot\cdot\cdot,M$ . These predictions are largely similar, however they contain complimentary perspective information especially around the occlusion borders. They are combined to provide a more informative prediction according to:

[TABLE]

Fig. 2(c) shows an example of the rendered target view $V_{d}^{t}$ for the scene $Workshop$ from the MPI Light Field Archive [2], where $t=40$ , with an angular baseline extension ratio of $\alpha$ =10x. Compared with the ground truth target SAI in Fig. 2(b), obvious distortions and occlusion gaps can be observed.

Additionally, we can also calculate the disparity map for the target SAI with SDR, in the same way as RGB images:

[TABLE]

We further quantized $W^{t}(x,y)$ into $L$ discrete labels $W_{L}^{t}(x,y)$ . An example of the labeled disparity map $W_{L}^{t}(x,y)$ for the scene Bikes [2] is shown in Fig. 3(b).

3.2 Stratified Spatial Granularities for Distortion Correction

In order to prevent content distortion and preserve structural consistency over significant angular transforms, we propose to utilize SDR across multiple spatial granularity levels based on units of superpixels (SP).

The concept of SP is to group pixels into perceptually meaningful atomic regions [1, 29, 18]. Boundaries of SP usually coincide with those of the scene content. The SPs are very adaptive in shape, and are more likely to segment uniform depth regions compared with rectangular units. We implement SP segmentations on the central view $I_{0}$ at multiple SP scales (different SP scales correspond to different pixel numbers within each SP). Smaller SPs provide finer granularity that better binds to object boundaries, while larger granularity prevents distortion over complex, and textureless scene regions.

Based on the central view’s pixel-wise disparity map $D_{0}(x,y)$ , we calculate its SP-wise disparity map $P_{0}^{s}(x,y)$ as introduced in [4]. The pixels within the same SP share identical disparity value (initially calculated as the median of confident pixels indicated by $C_{d}(x,y)$ , and then regularized over inter-SP border pixel’s discontinuities). The resulting SP-wise disparity map $P_{0}^{s}(x,y)$ (with superscript $s$ indicating the SP size) strictly adheres to the occluding boundaries, and preserves its internal pixel structure even after significant angular transform operations.

With the SDR operator $\mathcal{F}$ , we can synthesize the target novel SAI based on the SP-wise disparity map $P^{s}(x,y)$ :

[TABLE]

The central view can be warped to the target view based on different SP size levels, which provides predictions with pixel details strictly preserved at different scales. In our implementation, we set up 2 SP size levels with 100 and 400 pixels per SP, respectively. This gives synthesis outcome of $V_{sp}^{1,t}$ , and $V_{sp}^{2,t}$ . As shown in Figs. 2(d) and (e), compared with $V_{d}^{t}$ , there are more occlusion gaps in $V_{sp}^{s,t}$ . However, the SP helps correct regions with obvious wrong disparities over the occlusion borders (as explained in [4]), and it keeps textureless surfaces geometrically consistent after warping.

3.3 Surface Consistent Features

As can be seen in Figs. 2(c), (d), (e) and 3(b), there exist large ambiguous areas (zero valued pixels) in the SDR outputs $V_{d}^{t}$ , $V_{sp}^{1,t}$ , $V_{sp}^{2,t}$ and $W_{L}^{t}$ . And we aim at in-painting these ambiguous areas with realistic predictions.

Since the discretized disparity label map $W_{L}^{t}$ has limited value ranges and much simpler distributions as compared with RGB images ( $V_{d}^{t}$ , $V_{sp}^{1,t}$ , and $V_{sp}^{2,t}$ ), it is much easier to inpaint $W_{L}^{t}$ . We adopt a classic image morphological dilation operator which propagates larger labels (which corresponds to more distant surfaces) within a local window to the zero marked ambiguous pixels. The dilation operator is based on the simple while efficient logic that any regions of ambiguity belongs to its furthest-away local surface. An inpainted label map $\tilde{W}_{L}^{t}$ is shown in Fig. 3(c) for the data Bikes. As can be seen, the missing areas are well completed with well-defined occluding contours.

With $\tilde{W}_{L}^{t}$ as surface consistent guide, it can potentially instruct the inpainting process of RGB images by suggesting which surface the ambiguous region belongs to, such that relevant structural and textural features could be learned and transferred directly from these regions with identical labels and avoid confusion with the occluding surfaces.

3.4 Generative-Adversarial Network for Parallax Correction and Occlusion Completion

3.4.1 The Parallax Correction and Occlusion Completion Network

We propose a Generative-Adversarial Network to simultaneously correct parallax distortions caused by inaccurate disparity estimations, and inpaint the ambiguous regions caused by occlusion. As shown in Fig. 1, the Parallax Correction and Occlusion Completion (PCOC) network follows an encoder-decoder structure, which allows to reduce the memory usage and computational time by initially decreasing the resolution before further processing the image. The output is restored to the original resolution using deconvolution layers.

The initial warping outcomes by the SDR from different granularity levels $\{V_{d}^{t},V_{sp}^{1,t},V_{sp}^{2,t}\}$ , along with the inpainted disparity label map $\tilde{W}^{t}_{L}$ are used as input to the PCOC generator network in the following format:

[TABLE]

Both $V_{sp}^{t,1}$ and $V_{sp}^{t,2}$ are subtracted with $V_{d}^{t}$ , such that the input features $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$ combine information from all granularity levels. With variation range around zero, $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$ highlight the differences between the warping scales, which facilitate more efficient learning. $\mathcal{T}_{3}$ normalizes the label map $\tilde{W}^{t}_{L}$ to the range [-0.5,+0.5]. This feature is expected to guide the inpainting process of the RGB images to be surface consistent with the occluded region.

The occlusion gaps are zero valued in $V_{d}^{t}$ , $V_{sp}^{t,1}$ , and $V_{sp}^{t,2}$ . After the subtractions in $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$ , only those occlusion gaps that exist in both granularities are kept, which serves as an indicator to the PCOC network concerning over which pixels to carry out occlusion completion, and which areas to impose parallax correction such that the final RGB output confirms to the statistical distribution of an undistorted natural image.

In our implementation, the spatial patch size is set as $W$ = $H$ =128. The dimensionality of the input feature to the PCOC network is $\{\mathcal{T}_{1},\mathcal{T}_{2},\mathcal{T}_{3}\}\in\mathbb{R}^{W\times H\times 7}$ , and the output is an RGB image $\mathring{I}^{t}\in\mathbb{R}^{W\times H\times 3}$ . Note that in the end, $V_{d}^{t}$ will be added back to $\mathring{I}^{t}$ to restore color.

3.4.2 Context Discriminator

As illustrated in Fig. 1, the context discriminator is based on a CNN which gradually compresses the image into small feature vectors. Outputs of the networks are continuous value corresponding to the probability of the PCOC output being real. The discriminator works to improve the realism of the generator outputs by learning the distribution of ground truth views without parallax distortions or inpainting errors.

3.4.3 Training Details

We used the MPI Light Field Archive [2] for the training as well as evaluation of our model. The archive consists of 9 synthetic and 5 captured real-world scenes. All of the 14 LF data have a large angular base-line of 101 views, which allows enough angular flexibility to train our model. 9 center views (view index range from 47 to 55 out of the 101 views) were chosen to simulate the narrow-baseline LF input, side views with index range from 11 to 31 and 71 to 91 were used as ground truth for baseline extension. 2 real-world scenes Bikes and Workshop, and 2 synthetic scenes ArtGallery2zoom and Furniture1 were used as the testing dataset. The rest of the 10 scenes in the MPI Light Field Archive were used as the training dataset, and they have been augmented according to the methods introduced in [26]. Note that since the input LF baseline radius is $(9-1)/2=4$ , therefore view 11 and view 71 corresponds to an angular baseline extension rate of $\alpha$ =10x; and view 31 and 71 correspond to $\alpha$ =5x, respectively.

In order to train the network such that inpainting of occlusion areas, and the parallax correction areas are realistic looking, two loss functions are jointly used: an $L_{1}$ loss term $\mathcal{L}_{L_{1}}$ for content fidelity between the rendered novel view $\mathring{I}^{t}$ and the ground truth view, and a Generative Adversarial Network (GAN) [6] loss $\mathcal{L}_{GAN}$ to improve the realism of the results. The final objective is:

[TABLE]

To optimize our networks, We used stochastic gradient descent (SGD) to minimize the objective functions. We alternated between one gradient descent step on the generative PCOC net, and then one step on the context discriminator net. Mini-batch size is set as 20 for better trade-off between speed and convergence. The Xavier approach [5] is used for network initialization, and the ADAM solver [17] is adopted for system training, with parameter settings $\beta_{1}=$ 0.9, $\beta_{2}=$ 0.999, and learning rate $\alpha=$ 0.0001. Following suggestions from [11], we train the PCOC net 1000 epochs, and then start the alternating training with the discriminator.

4 Model Evaluation

We evaluate our SLSC model both quantitatively and qualitatively, and compare its performance with state-of-the-art methods, including the spatial-angular alternating filtering algorithm (SAA-ECCV18) [38], the learning based view synthesis algorithm (LVS-TOG16) [14], and a direct warping algorithm based on disparity maps estimated based on superpixel regularization over partially occluded border regions (POBR-TIP18) [4]. Note that both SAA-ECCV18 and LVS-TOG16 were originally designed for LF view interpolation given a sparsely sampled LF input. Both of them have been retrained over the MPI dataset for the purpose of baseline extension (extrapolation). 4 interleaved layers have been adopted for SAA-ECCV18. 9 adjacent center views (view index range from 47 to 55) were selected as the input reference for both networks, and side views of index range 11 to 31, and 71 to 91 were used as ground truth for training, which represents angular baseline extension limits from $\alpha=$ 5x to $\alpha=$ 10x.

Another baseline method was also chosen for comparisons, in which we used $\{\mathcal{T}_{1},\mathcal{T}_{2}\}$ as the input feature to the PCOC network, without the surface consistent guide $\{\mathcal{T}_{3}\}$ . This baseline was used to assess the importance of the surface consistency feature for the synthesis of novel views. We denote this method as SL (without Surface Consistency feature SC).

4.1 Quantitative Evaluation

We quantitatively evaluate the PSNR (Peak signal-to-Noise Ratio) and SSIM (Structural Similarity Index) of rendered novel views. Table 1 shows the results for the four testing scenes at different baseline extension ratios of 5, 7, and 9 respectively. Additionally, the average PSNR for all the testing data have been plotted as curves for each competing method in Fig. 4.

As can be seen, our proposed SLSC model provides a consistent 2-5 dB advantage over all compared methods. The advantage is especially large at larger baseline extension ratios.

4.2 Qualitative Evaluation

We visually compare the quality of rendered novel views from different methods. Fig. 5 shows the output from the PCOC-Net based on the input features. We can see that the ambiguous regions caused by occlusions and perspective shifting have been restored well. In the top-left example, note how the width of the chair legs have been restored to normal compared with the distorted input $V_{d}^{t}$ . Note that to directly assess the PCOC network, we have not added any post-processing algorithms over the inpainted areas. The PCOC network learns the textures of missing areas, and other simple techniques can be applied to blend the color to be consistent with its surroundings [11].

More visual comparisons are given in Fig. 7. As can be seen that the rendered novel views from our proposed SLSC model provides the highest quality outputs, while other competing methods show obvious blur and errors for structures with significant angular transforms.

4.3 Ablation Study

For the ablation study, we focus on the contribution of the surface consistent label map $\tilde{W}_{L}^{t}$ on the quality of synthesized novel SAIs. With $\tilde{W}_{L}^{t}$ as guide, the SLSC system produces more stable and surface consistent inpainting outcomes as highlighted in Fig. 6. The baseline system SL (without SC) tends to produce blurred and incorrect predictions around the occlusion borders.

5 Conclusion

We have proposed an LF view synthesis algorithm which renders high quality novel LF views far outside the angular baselines of reference SAIs. A Stratified synthesis strategy is adopted which parses the scene content based on different disparity layers and based on different spatial granularities. Such a stratified methodology proves to be helpful in preserving scene content structures over large perspective shifts, and it provides informative clues for inferring the textures of occluded regions. A generative-adversarial model has been adopted for parallax correction and occlusion completion. Experiments show that our proposed model can provide more reliable novel view synthesis quality especially at large baseline extension ratios.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence , 34(11):2274–2282, 2012.
2[2] V. K. Adhikarla, M. Vinkler, D. Sumin, R. K. Mantiuk, K. Myszkowski, H. Seidel, and P. Didyk. Towards a quality metric for dense light fields. In IEEE Conference on Computer Vision and Pattern Recognition , pages 3720–3729, July 2017.
3[3] J. Chen, J. Hou, and L. P. Chau. Light field compression with disparity-guided sparse coding based on structural key views. IEEE Transactions on Image Processing , 27(1):314–324, Jan 2018.
4[4] J. Chen, J. Hou, Y. Ni, and L. Chau. Accurate light field depth estimation with superpixel regularization over partially occluded regions. IEEE Transactions on Image Processing , 27(10):4889–4900, Oct 2018.
5[5] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics , pages 249–256, 2010.
6[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems , pages 2672–2680, 2014.
7[7] L.-w. He, J. Shade, S. Gortler, and R. Szeliski. Layered depth images. 1998.
8[8] J. Hou, J. Chen, and L.-P. Chau. Light field image compression based on bi-level view compensation with rate-distortion optimization. IEEE Transactions on Circuits and Systems for Video Technology , 29(2):517–530, 2019.