Learning to Predict Image-based Rendering Artifacts with Respect to a   Hidden Reference Image

Mojtaba Bemana; Joachim Keinert; Karol Myszkowski; Michel B\"atz,; Matthias Ziegler; Hans-Peter Seidel; Tobias Ritschel

arXiv:1812.02552·cs.GR·August 22, 2019

Learning to Predict Image-based Rendering Artifacts with Respect to a Hidden Reference Image

Mojtaba Bemana, Joachim Keinert, Karol Myszkowski, Michel B\"atz,, Matthias Ziegler, Hans-Peter Seidel, Tobias Ritschel

PDF

TL;DR

This paper introduces a neural network that predicts image quality differences without needing a reference image, improving robustness and enabling applications like faster light field capture and depth adjustment.

Contribution

We propose a novel neural network architecture and training method that accurately predicts image differences without reference images, outperforming traditional metrics in certain scenarios.

Findings

01

The no-reference metric can outperform reference-based metrics subjectively.

02

The approach reduces light field capture time.

03

It provides guidance for interactive depth adjustment.

Abstract

Image metrics predict the perceived per-pixel difference between a reference image and its degraded (e. g., re-rendered) version. In several important applications, the reference image is not available and image metrics cannot be applied. We devise a neural network architecture and training procedure that allows predicting the MSE, SSIM or VGG16 image difference from the distorted image alone while the reference is not observed. This is enabled by two insights: The first is to inject sufficiently many un-distorted natural image patches, which can be found in arbitrary amounts and are known to have no perceivable difference to themselves. This avoids false positives. The second is to balance the learning, where it is carefully made sure that all image errors are equally likely, avoiding false negatives. Surprisingly, we observe, that the resulting no-reference metric, subjectively, can…

Tables1

Table 1. Table 1: Error of the metric predictions on the test data for different variants of our algorithms and different partitions ( All/Clean/Distorted ) of the training data (columns) on different metrics (rows) . Winners per-partition are marked bold.

Metric	Full			NoNatural			NoBalance
Metric	All	Cle.	Dist.	All	Cle.	Dist.	All	Cle.	Dist.
MSE	.098	.006	.189	.137	.092	.182	.102	.003	.201
SSIM	.078	.013	.143	.143	.159	.127	.080	.012	.149
VGG	.085	.006	.165	.207	.293	.121	.092	.008	.176

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Learning to Predict Image-based Rendering Artifacts

with Respect to a Hidden Reference Image

Mojtaba Bemana1 Joachim Keinert2 Karol Myszkowski1 Michel Bätz2 Matthias Ziegler2 Hans-Peter Seidel1 Tobias Ritschel3

1MPI Informatik 2Fraunhofer IIS 3University College London

Abstract

Image metrics predict the perceived per-pixel difference between a reference image and its degraded (e. g., re-rendered) version. In several important applications, the reference image is not available and image metrics cannot be applied. We devise a neural network architecture and training procedure that allows predicting the MSE, SSIM or VGG16 image difference from the distorted image alone while the reference is not observed. This is enabled by two insights: The first is to inject sufficiently many un-distorted natural image patches, which can be found in arbitrary amounts and are known to have no perceivable difference to themselves. This avoids false positives. The second is to balance the learning, where it is carefully made sure that all image errors are equally likely, avoiding false negatives. Surprisingly, we observe, that the resulting no-reference metric, subjectively, can even perform better than the reference-based one, as it had to become robust against mis-alignments. We evaluate the effectiveness of our approach in an image-based rendering context, both quantitatively and qualitatively. Finally, we demonstrate two applications which reduce light field capture time and provide guidance for interactive depth adjustment.

1 Introduction

Computer vision or graphics experts easily recognize image artifacts that might be highly domain-specific. An image-based rendering (IBR) specialist will quickly notice where depth estimation failed, where transparency was not handled or where a highlight did not move correctly. Similarly, in computer graphics, artifacts resulting from Monte Carlo noise in image synthesis when producing a feature film, or shadow bias [59] in a computer game are easily spotted by domain experts.The assessment typically is not limited to detection, but importantly includes judging magnitude as well as spatial locality.

The importance of interacting with errors can be seen from photographs with spatially annotated over- and under-expose artifacts, as done for instance by Henri Cartier-Bresson [11]. Remarkably, all this is not achieved by comparing an image to a reference, but by experience and intuition built from knowing what natural images look like and how images with artifacts differ. Can we enable a machine to also perform such a task?

More formally, we face the challenge illustrated in Fig. 1. Given an image $\mathcal{A}$ that is a distorted version of a reference $\mathcal{B}$ we wish to predict their difference $\mathcal{A}\ominus\mathcal{B}$ without access to $\mathcal{B}$ . The lower right image shows the ground truth metric response $\mathcal{A}\ominus\mathcal{B}$ . This metric could simply be the mean square error (MSE as used in Fig. 1), a more perceptual metric like SSIM [57] or even VGG-16 activation differences that are effective as an image metric [48, 62]. More particularly, we go beyond the typical mean opinion scores [50] given to uniform distortions such as noise or JPEG compression, and seek to produce localized distortion visibility maps without accessing the reference.

In this paper, we choose to study one specific form of artifacts that arise in image-based rendering (IBR) [37, 16], in particular, when employed for novel-view synthesis from sparse light fields (LFs) [29]. It is important in virtual reality and movie production where LFs are used to provide head motion parallax and special effects. Moreover, having a localized error prediction is also important for quality control. In IBR, artifacts are very localized (e. g., around certain depth edges) and creating opinion scoring or even spatio-angular annotated dataset of LF artifacts in a size sufficient for machine learning appears to be a daunting task. Our method proceeds without all of this.

Addressing this challenge, we make use of convolutional neural networks. We will show, how learning this mapping right away will result in many false positives or false negatives. Instead, two important ingredients come together in our approach. First, as the number of images containing artifacts is typically limited, we need to augment the training data with natural images that are free from artifacts. Second, we propose a way to find the right balance between natural and distorted training data.

Not requiring a reference is useful whenever the original is inaccessible (lost, impossible to compute, unavailable, undefined). Furthermore, we demonstrate one application of a non-reference metric in light field capturing. We first capture a sparse light field, followed of by an interpolation of the intermediate views. If our our metric indicate those intermediate views have errors, they views will be recaptured. This allows acquiring higher-quality light field in much shorter time compared to dense LF capturing.

2 Previous Work

In this section, we discuss objective image quality metrics, with special emphasis on those that do not require the undistorted reference image. Then, we briefly characterize IBR-specific artifacts, as well as metrics specialized in their detection, which is the key focus of this work.

Image metrics

Some application and functions may require quality while others need visibility metrics [10].

Image quality metrics (IQMs) evaluate the distortion magnitude and are typically trained on the mean-opinion score (MOS) data [47, 42] that labels the entire image with as a single quality score. The most commonly used IQMs such as PSNR, SSIM, MS-SSIM [56], FSIM [61], and CIELAB [63] are full-reference (FR) metrics that take as input the reference and distorted images, and compute local differences that are pooled into a global, single quality score. Recently, it has been demonstrated that CNN-based FR-IQMs achieved best performance in predicting MOS data [2, 7]. Zhang et al. [62] employed crowdsourcing and created a large scale patch-based dataset in two perceptual experiments: (1) two-alternative forced choice (2AFC) on distortion strength, and (2) “same/not same” near-threshold distortion visibility. They train different network architectures and report in each case a much better performance than traditional FR-IQMs in predicting their data from both experiments.

Visibility metrics (VMs) predict the distortion perceptibility for every pixel in the form of visibility maps. VMs are specifically tuned for detecting near-threshold distortions, which is required in many graphics and vision applications that cannot tolerate any perceivable quality reduction and require local information on the distortion positions. To decide on the visibility of such near-threshold distortions, models of human vision are often employed, where the most prominent FR-VMs examples include: VDM [34], VDP [14], and HDR-VDP-2 [35]. In the specific task of predicting selected rendering and compression artifacts, best performance has been achieved using machine learning [8] and CNN-based techniques [60, 40].

No-reference metrics

In this work, we focus on the VMs due to the locality of their prediction, but we are specifically interested in more challenging no-reference setup, where the reference image is not available. We discuss the most successful and recent NR-IQMs that rely on machine learning techniques, and we also refer the interested reader to more comprehensive metric surveys in [10, 28]. Early machine learning techniques employed predefined features such as SIFT and HOG [39, 38, 44, 51], and measured their distortions with respect to natural image statistics [56]. Recently, CNN architectures are applied to such feature learning as well as the MOS regression at the same time [5, 24, 7, 50]. To compensate for a low number of MOS-labeled images, such solutions typically rely on patches, where they assign the same MOS score for all patches that belong to a given image [28]. Such practice is justified for specific classes of distortions that affect the whole image uniformly, which might be the case for certain types of image noise or compression artifacts, but might confuse the network in case of localized distortions such as those occurring in IBR.

To compensate for the lack of true local reference images, Bosse et al. [7] learn the importance of local patches, but their key motivation is not in deriving the localized VM, but rather in estimating relative patch weights in the aggregated MOS rating. Lin and Wang [31] employ a quality-aware generative network to hallucinate the reference image, which by employing adversarial learning is further refined by an IQM-discriminator that is trained on ground truth references. Their hallucination-guided quality regression network is fed with the difference between the hallucinated and distorted images, as well as the distorted image itself to predict the MOS value. The quality-aware generative network, hallucination-guided quality regression network, and the IQM-discriminator are jointly optimized in an end-to-end manner. Kim and Lee [27] apply state-of-the-art FR-IQMs such as SSIM to generate proxy scores on patches as the ground truth to pre-train the model and then fine-tune their target NR-IQM. At intermediate stages the regression network considers mean values and the standard deviations of per-patch 100-element feature vectors which are then pooled to a per-image quality score.

In this work, we also employ state-of-the-art FR-IQMs to perform an initial per-patch distortion annotation, and strike the required balance between different error magnitudes in the training data, which is essential for meaningful training and shift-invariant properties of our NR-VM.

The research on NR-VMs is extremely sparse, presumably due to limited access to locally labeled images [21, 8, 60]. A notable exception is the work of Herzog et al. [21] who employs support vector machine (SVM) to predict per-pixel distortions for selected rendering artifacts (they do not consider IBR) and achieve performance comparable to FR-VMs. Here, we demonstrate that time-consuming manual per-pixel distortion labeling is not strictly required.

In cases where training data is both easy to produce–such as uniform distortions like noise, JPEG, etc.–and no perceptual calibration is required, supervised training has been employed to detect aliasing artifacts [40]. Our work differs, as we only have very limited training data available, both because only very few ground truth images are available for IBR and we need perceptual calibration. Learning from little data is part of our balancing contribution.

Vogels and colleagues [54] have proposed a method to denoise path traced images. To steer the amount of denoising, they also trained a neural network to predict distortion in terms of MC variance, which is as unknown as the pixel value to be MC-estimated itself. Interestingly, in both their work and ours, a NR metric is used to steer adaptation: for them it is a denoising algorithm; for us, one application is controlling capture hardware. Their task is different as they predict SSIM error from a pair of images, where one is noisy and the other is denoised. This restricts the distortions to the difference between denoised and reference, which are smaller than IBR artifacts and also does not need to be perceptually calibrated. The fact that images with MC noise can be generated in arbitrary amounts also underlines what is the focus of our work: coping with limited training data.

Image-based rendering

for structured or unstructured light fields (LFs) of real-world scenes involves a number of computational steps such as: depth reconstruction, neighboring view-image warping, warped view-image blending, and disocclusion hole in-painting. Each of these steps is prone to inaccuracies that manifest themselves as IBR-specific artifacts such as object shifting (incorrect depth), crumbling, distorted edges (depth discontinuities, e. g., due to compression), popping (fluctuations in depth), ghosting (depth inaccuracy, view blending), stretching, blurry or black regions (in-painting) [52]. Specialized IBR quality metrics often rely on leaving one view out as the reference [55, 12, 49, 6] or searching for matching image blocks after their registration [3, 17], and then employing customized FR-IQMs. NR-IQMs typically focus on detecting selected distortion types such as blurring and ghosting [4], ghosting and popping [18], blurring, stretching and black holes [52], and aggregation into one final scalar score. Perceptual experiments have been performed to understand how the observers rate the severity of different artifacts as a function of rendering parameters such as the number of blended views and viewing angles [53]. A skillful pre-processing of depth (e. g., depth blurring in uncertain regions) and choice of particular algorithmic solutions can substantially suppress artifacts [20, 46], eventually using a neural network trained to predict blending weights to combine the warped images[19]. More objectionable distortion types can be traded-off with those that are more visually appealing (e. g., blurry depth that is more consistent but further from the ground truth). Instead of focusing on selected distortion types, Ling et al [32] proposes to learn a dictionary based on manually labeled data. The features extracted from an image allows to predict a MOS value using support vector machine regression. As data labeling can be time consuming, as Ling et al. [33] create artificial training data that aims to simulate occlusion problems. A Generative Adversarial Network (GAN) discriminator [15], targeted to identify in-painted image regions, is used to predict a quality score.

All the discussed work on IBR quality evaluation essentially focuses on providing a single score per-image, which then also serves as a metric for performance evaluation. While some FR-IQMs generate viable per-pixel VMs at intermediate stages [12, 49], their accuracy is not formally evaluated. The same holds for the NR-IQM [32]. Our work hence differs from all previous work by pursing the NR-VM setup to detect local IBR distortions using CNN-based techniques.

3 Learning a No-reference Metric

Overview

Test-time input to our method is a single distorted RGB image $\mathcal{A}$ . While our distortions are always IBR artifacts resulting from a specific depth reconstruction and specific IBR method, the interna of how this image is generated (e. g., the depth map) are transparent, and we only need access to the result. Withheld is the reference RGB image $\mathcal{B}$ . In the case of IBR, such a distorted-undistorted pair is typically produced by rendering a known image from other known views.

Output of our proposed method is a single-channel (scalar) image that predicts a given difference metric response $\mathcal{A}\ominus\mathcal{B}$ , where the $\ominus$ operator depends on the choice of the specific metric, e. g., MSE, SSIM [57], or VGG16 [48]. High values are produced where the images are different and small values where they are similar. This output is accurate, if it has little false positives or negatives. False positives correspond to predicting a perceived difference where there are no artifacts and false negatives correspond to visible artifacts the metric fails to report.

Note that two forms of approximations are made here: the first is the error that the metric itself makes when comparing two images relative to human judgment. The second is the error that our method has, with respect to a prediction. Ultimately, our method is a prediction of a prediction, but surprisingly can perform better than one prediction alone.

3.1 Training data

Our training data comprises existing metric responses $\mathcal{A}\ominus\mathcal{B}$ to the distorted image $\mathcal{A}$ and the clean reference image $\mathcal{B}$ . Strictly speaking, learning does not even observe the reference image $\mathcal{B}$ , but in practice, it is required to compute the metric response $\mathcal{A}\ominus\mathcal{B}$ .

For creating our training dataset, we used captured LF images of 42 different scenes, which come from the Stanford LF repository [1], the Fraunhofer IIS light field dataset [13], Google Research work [41], and Technicolor [45] as well as from our own captured images. All 4D LF datasets comprise conventional 2D images in a resolution up to 2k $\times$ 2k, taken from a range of sparse view points, such as in a 3 $\times 3$ camera array with known camera positions. For each LF view point, we first estimate the depth using a light field depth estimation technique [13] and then warp [36] the image into all other views. For each LF, we use the four corner views to generate novel-view images at the positions of the remaining views. Each warped view corresponds to one original view, and we compute the response of a full-reference metric to this pair. With approx. 9 views per LF and 42 LFs in total, this amounts to only 210 unique images, i. e., a comparatively low number for a training task.

We use six scenes for testing and the rest for training. The same split is also applied later for the user study. Our test scenes are totally different from the training scenes, which is important as the number of scenes in the training set is small and generalization across them is an additional challenge.

The natural images used in our training and test dataset are sourced from the Inria Holidays image dataset [23] which have a comparable resolution to our LF images.

Our method is independent of the actual underlying metric $\ominus$ we predict. We will denote this response neutrally as $\mathcal{A}\ominus\mathcal{B}$ . We explored three metrics: MSE, SSIM and VGG16. MSE is defined as the average per-pixel RGB difference vector length squared. The SSIM metric is using the original implementation [57]. VGG16 [62] transforms both $\mathcal{A}$ and $\mathcal{B}$ into the VGG16 feature space and picks the activations at layer five, which is 512-dimensional. The $L_{2}$ difference of these two vectors is used as the metric response. For each metric, we normalize the 95th percentile of their responses across the training dataset to fall between 0 and 1.

3.2 Architecture

We use a simple encode $P$ [43] that has learnable parameters $\Theta$ and predicts the error map $P(\mathcal{A}|\Theta)$ by observing $\mathcal{A}$ (Fig. 2).

The network comprises 5 layers ( $32\times 32$ patch size) with the total number of $|\Theta|=175,537$ learnable parameters and is trained on all patches of the training set in a sliding window fashion.

The loss is the $L_{1}$ error of the predicted metric response, so $||P(\mathcal{A}|\Theta)-(\mathcal{A}\ominus\mathcal{B})||_{1}$ . Note that the loss is always $L_{1}$ , while the metric can be the $L$ -norm-like MSE as well as SSIM or VGG16.

Balancing

We have explained why, and will see from the ablation study, that it is important to have natural patches, but the question is how many. If we take an unlimited number, the metric prediction simply always returns zero, because natural patches have no error to themselves.

Our solution is to start with a half-half mix of distorted and clean patches. Regrettably, many of the distorted patches, which make 50 % of the total, also have small errors that are close to zero. These patches are exactly those for which IBR was successful, i. e., did not have any artifacts. Depending on the metric, this imbalance can be very strong, and in particular for MSE, it is extremely heavy-tailed (Fig. 3). To address this, we balance the error distribution for the distorted half when creating the training data as follows: First, we sort all patches by their metric response into a priority queue. Then, we uniformly random-sample the range from zero to the 95th percentile of the metric response distribution. For every sample $i$ with value $\xi_{i}$ , we find the patch $j$ with the most similar metric response $d_{i}$ and remove it from the queue and add it to the training dataset. When the minimum difference $\xi_{i}-d_{j}$ is larger than a threshold $\epsilon$ , we reject the sample. This is repeated until a target patch count, such as 250 k, is reached.

4 Evaluation

4.1 Methods

Training Strategies

We compare three different strategies for training. The first is ours, the other two are ablations. Full is our complete method involving 50 % natural patches and a balancing of the other 50 % as described in Sec. 3.2. NoBalance is realized by a similar 50/50-split, but we train on all distorted patches without the balancing. NoNatural adapts the balancing to take 100 % of the patches coming from IBR without adding the natural patches as described in Sec. 3.2. All training sets, albeit processed differently, have the same size of ca. .5 M patches.

Error

As we predict metric responses, our error is the same as the loss, the absolute difference between the ground truth metric response and our prediction of that response. As these errors also come in arbitrarily different scales for different metrics, we normalize them per metric by dividing by the global 95th percentile of the GT metric response across the balanced training dataset.

We additionally report errors in metric prediction errors for a split subsets to understand the false/true-positive and false/true-negative tendency. In All, we compute the error for the whole test dataset. Additionally, we consider two subsets of the test dataset. The first subset is Clean, which includes only natural patches. The second one is Distorted that contains only IBR patches, including those that might also come out with very low or even with no error. Please note that this is a partitioning of the test set, and not of the training set.

4.2 Quantitative results

In this section, we discuss both the means and full error distributions of all training strategies for different partitions and different metrics.

Means

The means of all methods are compared in Tbl. 1. We see that our method (Full) has the smallest error across different metrics compared to both other variants (bold in column All).

In detail, when we look into the partitioning, we find that for the Distorted partition, the NoNatural strategy performs best. This is expected as training is done with all distorted patches which comprise the maximal variety of distortion. This makes the resulting metric sensitive for all kinds of distortions. As a result, the probability of false negatives, i. e., claiming patches with an error to be fine, becomes low.

We also find, that for the Clean partition, the NoBalance strategy performs best. This also is expected as in the training, 50 % of data comprises natural (undistorted) patches, and due to the NoBalance strategy, small errors dominate in the distorted patches. This makes the resulting metric particularly sensitive for near-threshold distortions. In this case, the probability of false positives, i. e., reporting a high metric response for no-error patches, is low.

All statements are true (significant, $p<.01$ , $t$ -test after testing for Gaussianity) across all metrics, indicating that the Full approach is independent of the underlying metric. A positive exception is VGG, where the Full approach even performs better than NoBalance on the Clean partition.

Distributions

In Fig. 4, we show the distribution of errors for different metric predictions (top) and the correlation of the prediction error and metric response (bottom). In each plot, colors encode the variants of our approach (NoNatural, NoBalance, Full).

Each plot in the first row of Fig. 4 shows the sorted error of our metric prediction in ascending order. We see that across the entire range, with the exception of MSE prediction for low errors; the Full approach performs better than other variants. This indicates that the mean is a good characterization of the performance. In all cases, we noticed a sudden increase in the error that occurs around 50 % of the population, i. e., the error for the first half of the population seems to follow a different trend than the second half. We hypothesize that, these are the patches where reference and input are (partially) not aligned, which make up roughly 50 % of the population as well. Unfortunately, there is no way to tell apart a misaligned patch that is judged by FR metrics as different with respect to a displaced reference. Hence, large errors are expected to become undetectable at some error level. The exception is the regime in MSE where the Full approach is worse on low errors and slightly better on high errors, while it performs best on average in (Tbl. 1). This can be difficult to comprehend due to the log scale of the vertical axis.

Each plot in the second row in Fig. 4 shows the error of our prediction on the vertical axis and the metric response on the horizontal axis as a connected scatter plot. We can see that the plots are in accordance with Tbl. 1: The NoNatural method which performs best in predicting high metric responses, has a high error on patches with small metric response (false positives). Symmetrically, the NoBalance method which is the best at predicting low metric responses, produces high errors on patches with high metric response (false negatives). Full method is always a bit worse than one other method in one region (except at the unique point where both cross), but on average performs best overall.

4.3 Qualitative results

Example metric outputs

Fig. 5 shows an analysis of the response of all metrics to two different LFs from the test set. The first column shows the distorted input $\mathcal{A}$ in the top, below the hidden reference $\mathcal{B}$ and below this three insets from both. The second column shows our predicted response $\mathcal{A}\ominus\mathcal{B}$ for different metrics: MSE on top, followed by SSIM and VGG. A false color coding, where cold colors indicate a low response and warm colors indicate a high response, is used. The third column shows the GT response for the same. It is evident that there is a similarity between our prediction and the ground truth. We slightly err towards conservative, i. e., miss a few errors. How some of these errors are only false findings, i. e., a limitation of the metrics, becomes apparent from the user study to follow.

The last column shows a sanity check where we put the hidden reference image $\mathcal{B}$ into our metric. The hidden reference obviously does not contain any error, and consequently reporting one is a false positive. We see, that our image has a responses in areas that are correct but look like IBR artifacts, but in most areas has no response. In summary, this indicates that we localize and scale errors to a hidden reference in images with artifacts, while avoiding to produce a signal when facing clean images. It might appear that MSE has less false positives than SSIM or VGG when inspecting the last column; simply more deep blue, very close to perfect in the first row. However, such a trend is not supported by the numbers in Tbl. 1 or the plots in Fig. 4. The true reason for this impression might be that the SSIM and VGG response simply have a larger receptive field per-se: MSE is per-pixel while VGG is affected by up to $32\times 32$ pixels. Even the ground truth response is more dense (less deep blue). Consequently the metric prediction, in case of error, also makes spatially more extended, more dense, mistakes.

Transformation-invariance

Surprisingly, results produced by our approach can turn out to be better than their own supervision, as our method is forced to come up with strategies to detect problems without seeing the reference. This makes it immune to a common issue of many image metrics: misalignment [26]. Even a simple shift in image content will result in many false positives for classic metrics (Fig. 6). An image that has merely been shifted is reported to be very different from a reference by all the metrics used for our supervision; however, it shows less differences in case we add IBR artifacts to it. In contrast, our method does not care about transformation, but when IBR artifacts are added, they are detected. As our proposed method is oblivious to the ground truth, it is not subject to such a misconception. While not quantifiable, the result is arguably more similar to human judgment, as indicated by the user experiment in the next subsection.

4.4 User study

We have conducted a user experiment to validate that our predicted metric responses spatially correlate with the visibility of artifacts to human subjects. We quantify the human responses by means of per-pixel annotations, which are painted on top of images showing IBR artifacts. Note that no user responses was used for training.

Methods

Naïve users were asked to use a binary painting interface to mark errors in a rendered image for each of the six LFs of our test dataset in an open-ended session that took 15 minutes on average. We average the binary response into a continuous fraction (percentage) of users that detected the location of the artifacts.

Analysis

Asking $N=10$ users, we find the correlation (Pearson linear correlation $R$ , higher values are better; statements highly significant as the correlation is computed on a high number of image pixels) reported in Fig. 7-b. We see that for many scenes as well as for the average across scenes, our method has a higher correlation with user annotation than the metric it was supervised on. We hypothesize, that this is due to the fact that our network had learned to become independent of a reference, a similar robustness that the HVS employs. There is no clear trend on which of our metric response predictions correlates the most with the user annotations. The differences between scenes, however, seem more pronounced.

When repeating the experiment with a non-aligned reference (shifted a mere 20 px to the right), we find the correlations reported in Fig. 7-c. We see that our correlation even improves in this condition(our metric shows higher correlations for all metrics across different scenes), showing we are more robust to alignment issues when predicting user responses.

Perceptualization

Finally, we computed a linear correlation $R$ by fitting a model $x_{i}=a\cdot y_{i}+b$ , where $x_{i}$ is the user response and $y_{i}$ is our prediction of the metric response for pixel $i$ . This allows a “perceptualization” of our metrics response. Fitting multiple models $a,b$ in a leave-one-out protocol to 5 of our 6 scenes produces an average error of $.05/.04/.02$ for MSE/SSIM/VGG respectively, indicating that this perceptualization generalizes to some extent.

4.5 Other architectures

We also explored using other architectures with or without balancing. A simple solution would be to use a supervised image translation network such as Pix2Pix [22] to map from entire IBR images to the metric response. Unfortunately, training these on our data converges to a flat response of zero, as artifacts are too rare and subtle to be picked without the balancing we suggest. Future work could investigate combining our balancing with other architectures.

4.6 Supplementary materials

Ground-truth responses of all metrics and our predictions for all input images, for all variants of the algorithm, as well as all user study annotations can be explored in an interactive web application in the supplementary materials.

5 Applications

We will now demonstrate two practical applications of a NR-IQM in light field production. The first is accelerating automated adaptive LF capture (Sec. 5.1), the second employs our NR-IQM as a feedback in an interactive depth manipulation system (Sec. 5.2).

5.1 Adaptive light field capturing

Capturing a dense set of input view images results in a high-quality reconstruction but remains a time-consuming process or may require a bulky setup. Our main observation is that not all input view images contribute equally to the reconstruction of novel-view images. Our metric helps identifying and capturing these.

Images from views dominated by planar diffuse surfaces can reliably be predicted from images taken from other views showing this very same surface. Hence, dense capturing from these views is needed and thus not efficient.

In contrast, occlusions and specularity can be more challenging, because it must be ensured that each scene element is visible in at least two camera views (when using multi-view stereo, as we do) to compute depth. Sparse capturing from these views would sacrifice the reconstruction quality.

To both of these ends, we propose an adaptive capturing mechanism as it illustrated in Fig. 8 to capture an image for a view only if it cannot be extrapolated from other views.

5.1.1 Setup

We study adaptive capturing by means of a large-scale translation stage equipped with a digital camera. The position of the camera can be controlled with a precision of $80\,\mu m$ in horizontal and $50\,\mu m$ in vertical direction. This allows for very dense capturing of the scene. While this takes long to capture, it serves as a unique baseline to our study where we can compare our prediction of an error to the actual error present.

5.1.2 Procedure

We first capture a sparse set of images and estimate the depth maps for each view. Then, we use DIBR to render a set of intermediate-views and compute the reconstruction error for each rendered view. All pixels are simply averaged in each view image, producing a single scalar value. The capturing grid is then subdivided into smaller regions where average predicted reconstruction errors is larger than a given threshold. This process is repeated until a desired quality is achieved. By this approach, the number of captured views can be substantially reduced, and we only need to capture images at locations where reconstruction is poor.

Predicting the reconstruction error of novel view is the key to make such an approach work. Classic full-reference image quality metrics require a dense capture to provide reference images to compute the error, which is not practical as our goal is to reduce the number of captured images in the first place. In contrast, our proposed no-reference metric can measure the error in the novel view images without providing their reference images, resulting in an efficient approach.

5.1.3 Evaluation

To evaluate effectiveness of our metric in this application, we simulate capturing two LFs, adapted according to the MSE metric.

Array

We captured an array of 7 $\times 15$ images for the scene shown in Fig. 8 (left). In Fig. 9 we show the ground truth MSE (left) and our network prediction (right), where each grid element denotes a camera position. The dark blue grid elements indicate the camera positions where actual key frames were captured, while rendering has been performed for all remaining intermediate positions.

As we can see, the distribution of reconstruction error as predicted by our metric correlates well with the ground truth. Fig. 8 (right) shows new camera locations that are required to reduce the true average reconstruction error below .004.

Panoramic

We also demonstrate the potential benefit of our approach for an efficient panoramic (i. e., one-dimensional, linear) light field capturing. As it is shown in Fig. 10, depending on the scene content, not all regions in the scene require equally dense camera placement. Our metric successfully guides the capturing setup to take more photos in the regions with thin structures, substantial disocclusions or specularites where accurate reconstruction is highly challenging. Overall, capturing 76 instead of 720 images – a sparsity of 10.5 % – reduces the total capture time from 59 minutes to 4.9 minutes, i. e., by 91 %.

5.2 Interactive depth adjustments

Long acquisition times involved in capturing dense light fields make it a tedious and impractical task for some application fields. One of such fields is movie production, where the presence of highly dynamic scenes and time pressure discourages the use of dense light fields, and in such cases, only sparse light field capture using video camera arrays is seen as a convenient solution.

Unfortunately, automatic error-free light field reconstruction from a sparse capture is still an unsolved problem. To this end, there are ongoing research efforts to address the challenges such as the estimation of disparity in the presence of homogeneous areas, repetitive structures, fine-grained objects, or specularities. In such cases, interactive disparity estimation improvement seems to be the most promising solution to achieve a high-quality view rendering [58, 25, 30, 9]. However, this requires detecting possible view rendering artifacts as fast as possible to reduce the post-processing time. As shown in the right-most image of the second row in Fig. 10, spotting an artifact is not a trivial task and sometimes requires carefully scanning the view rendering result. Our quality estimation metric can significantly simplify this process by allowing the automatic analysis of several novel rendered views. By observing the predicted visibility map, which identifies the local distortions, the user can quickly spot the problematic regions. Using a post-production software suite 111https://www.iis.fraunhofer.de/realception to perform an interactive view rendering with only a small subset of cameras allows detecting the captured view responsible for the error. The inspection of the corresponding disparity map followed by an approach similar to [58, 25] finally allows fixing the view rendering error. This is achieved by manual creation of a geometry proxy in 3D space for objects whose disparity map could not be computed automatically. The proxy is then used to bound the admissible depth values for a subsequent disparity estimation.

The results of this procedure are illustrated in Fig. 11. The contained repetitive structures are very challenging for automatic disparity estimation and consequently lead to many view rendering artifacts as clearly indicated by the depicted error map. For solving these issues, a user has added proxy-based disparity constraints for the waste basket (and the contained figurine), the grid structure behind the flower, and the grid structure in the upper right corner of the image. By these means, a much better view rendering could be achieved as shown in Fig. 11. Our metric has reduced the time required to find those reconstruction errors, leaving more time to a user to correct them.

6 Conclusion

We have demonstrated that with properly adjusted training data (prioritization and natural supervision), a CNN can learn how to predict the difference of an image to a hidden reference. Our approach is independent of the metric used and we have shown MSE, SSIM and VGG prediction. Other metrics such as HDR-VDP-2 [35] or the CNN-based metric of Wolski et al. [60] would likely be predictable in a similar fashion.

Such a metric can be applied for several applications. As demonstrated this includes adaptive light field sampling of complex scenes and interactive depth editing. Moreover, since in contrast to any existing non-reference metric, our approach provides a predicted error map, this opens the potential for many novel applications such as interactive or automatic view rendering error correction.

In future work, we would like to overcome the limitations of the paired input, eventually using an adversarial [15] design, and learn the prediction only from pairs and without the metric, or only from pairs of undistorted-metric or distorted-metric.

Acknowledgements

This work was partly supported by the Fraunhofer-Max Planck cooperation program within the framework of the German pact for research and innovation (PFI) and a Google AR/VR Research Award.

Bibliography63

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] http://lightfield.stanford.edu/lfs.html.
2[2] S. A. Amirshahi, M. Pedersen, and S. X. Yu. Image quality assessment by comparing CNN features between images. J Imag. Sci. and Technology , 60(6):60410–1, 2016.
3[3] F. Battisti, E. Bosc, M. Carli, P. Le Callet, and S. Perugia. Objective image quality assessment of 3D synthesized views. Image Commun. , 30(C):78–88, 2015.
4[4] K. Berger, C. Lipski, C. Linz, A. Sellent, and M. Magnor. A ghosting artifact detector for interpolated image quality assessment. In IEEE Int. Symp. on Consumer Electronics , pages 1–6, 2010.
5[5] S. Bianco, L. Celona, P. Napoletano, and R. Schettini. On the use of deep learning for blind image quality assessment. ar Xiv:1602.05531 , 2016.
6[6] E. Bosc, R. Pepion, P. L. Callet, M. Koppel, P. Ndjiki-Nya, M. Pressigout, and L. Morin. Towards a new quality metric for 3-d synthesized view assessment. IEEE J of Selected Topics in Signal Processing , 5(7):1332–1343, 2011.
7[7] S. Bosse, D. Maniry, K. R. Müller, T. Wiegand, and W. Samek. Deep neural networks for no-reference and full-reference image quality assessment. IEEE TIP , 27(1):206–219, 2018.
8[8] M. Čadík, R. Herzog, R. Mantiuk, R. Mantiuk, K. Myszkowski, and H.-P. Seidel. Learning to predict localized distortions in rendered images. In Comp. Graph. Forum , volume 32, pages 401–10, 2013.