TL;DR
This paper introduces a probabilistically principled framework for pluralistic image completion, generating multiple diverse plausible solutions by combining reconstructive and generative paths supported by GANs and a novel attention layer.
Contribution
It proposes a novel dual-path framework with attention mechanisms to produce diverse image completions, overcoming the limitations of single-solution methods.
Findings
Generated higher-quality, diverse completions on multiple datasets.
Outperformed existing methods in diversity and quality of results.
Introduced a new attention layer for better appearance consistency.
Abstract
Most image completion methods produce only one result for each masked input, although there may be many reasonable possibilities. In this paper, we present an approach for \textbf{pluralistic image completion} -- the task of generating multiple and diverse plausible solutions for image completion. A major challenge faced by learning-based approaches is that usually only one ground truth training instance per label. As such, sampling from conditional VAEs still leads to minimal diversity. To overcome this, we propose a novel and probabilistically principled framework with two parallel paths. One is a reconstructive path that utilizes the only one given ground truth to get prior distribution of missing parts and rebuild the original image from this distribution. The other is a generative path for which the conditional prior is coupled to the distribution obtained in the reconstructive…
| Diversity (LPIPS) | ||
|---|---|---|
| Method | ||
| CVAE | 0.004 | 0.014 |
| Instance Blind | 0.015 | 0.049 |
| BicycleGAN [46] | 0.027 | 0.060 |
| PICNet-Pluralistic | 0.029 | 0.088 |
| RGB image |
|---|
| ResBlock start |
| ResBlock down |
| ResBlock down |
| ResBlock down |
| ResBlock down |
| (b) Encoder (Representation) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Pluralistic Image Completion
Chuanxia Zheng Tat-Jen Cham Jianfei Cai
School of Computer Science and Engineering
Nanyang Technological University, Singapore
{chuanxia001,astjcham,asjfcai}@ntu.edu.sg
Abstract
Most image completion methods produce only one result for each masked input, although there may be many reasonable possibilities. In this paper, we present an approach for pluralistic image completion – the task of generating multiple and diverse plausible solutions for image completion. A major challenge faced by learning-based approaches is that usually only one ground truth training instance per label. As such, sampling from conditional VAEs still leads to minimal diversity. To overcome this, we propose a novel and probabilistically principled framework with two parallel paths. One is a reconstructive path that utilizes the only one given ground truth to get prior distribution of missing parts and rebuild the original image from this distribution. The other is a generative path for which the conditional prior is coupled to the distribution obtained in the reconstructive path. Both are supported by GANs. We also introduce a new short+long term attention layer that exploits distant relations among decoder and encoder features, improving appearance consistency. When tested on datasets with buildings (Paris), faces (CelebA-HQ), and natural images (ImageNet), our method not only generated higher-quality completion results, but also with multiple and diverse plausible outputs.
1 Introduction
Image completion is a highly subjective process. Supposing you were shown the various images with missing regions in fig. 1, what would you imagine to be occupying these holes? Bertalmio et al. [4] related how expert conservators would inpaint damaged art by: 1) imagining the semantic content to be filled based on the overall scene; 2) ensuring structural continuity between the masked and unmasked regions; and 3) filling in visually realistic content for missing regions. Nonetheless, each expert will independently end up creating substantially different details, even if they may universally agree on high-level semantics, such as general placement of eyes on a damaged portrait.
Based on this observation, our main goal is thus to generate multiple and diverse plausible results when presented with a masked image — in this paper we refer to this task as pluralistic image completion (depicted in fig. 1). This is as opposed to approaches that attempt to generate only a single “guess” for missing parts.
Early image completion works [4, 7, 5, 8, 3, 13] focus only on steps 2 and 3 above, by assuming that gaps should be filled with similar content to that of the background. Although these approaches produced high-quality texture-consistent images, they cannot capture global semantics and hallucinate new content for large holes. More recently, some learning-based image completion methods [29, 14, 39, 40, 42, 24, 38] were proposed that infer semantic content (as in step 1). These works treated completion as a conditional generation problem, where the input-to-output mapping is one-to-many. However, these prior works are limited to generate only one “optimal” result, and do not have the capacity to generate a variety of semantically meaningful results.
To obtain a diverse set of results, some methods utilize conditional variational auto-encoders (CVAE) [34, 37, 2, 10], a conditional extension of VAE [19], which explicitly code a distribution that can be sampled. However, specifically for an image completion scenario, the standard single-path formulation usually leads to grossly underestimating variances. This is because when the condition label is itself a partial image, the number of instances in the training data that match each label is typically only one. Hence the estimated conditional distributions tend to have very limited variation since they were trained to reconstruct the single ground truth. This is further elaborated on in section 3.1.
An important insight we will use is that partial images, as a superset of full images, may also be considered as generated from a latent space with smooth prior distributions. This provides a mechanism for alleviating the problem of having scarce samples per conditional partial image. To do so, we introduce a new image completion network with two parallel but linked training pipelines. The first pipeline is a VAE-based reconstructive path that not only utilizes the full instance ground truth (i.e. both the visible partial image, as well as its complement — the hidden partial image), but also imposes smooth priors for the latent space of complement regions. The second pipeline is a generative path that predicts the latent prior distribution for the missing regions conditioned on the visible pixels, from which can be sampled to generate diverse results. The training process for the latter path does not attempt to steer the output towards reconstructing the instance-specific hidden pixels at all, instead allowing the reasonableness of results be driven by an auxiliary discriminator network [11]. This leads to substantially great variability in content generation. We also introduce an enhanced short+long term attention layer that significantly increases the quality of our results.
We compared our method with existing state-of-the-art approaches on multiple datasets. Not only can higher-quality completion results be generated using our approach, it also presents multiple diverse solutions.
The main contributions of this work are:
A probabilistically principled framework for image completion that is able to maintain much higher sample diversity as compared to existing methods; 2. 2.
A new network structure with two parallel training paths, which trades off between reconstructing the original training data (with loss of diversity) and maintaining the variance of the conditional distribution; 3. 3.
A novel self-attention layer that exploits short+long term context information to ensure appearance consistency in the image domain, in a manner superior to purely using GANs; and 4. 4.
We demonstrate that our method is able to complete the same mask with multiple plausible results that have substantial diversity, such as those shown in figure 1.
2 Related Work
Existing work on image completion either uses information from within the input image [4, 5, 3], or information from a large image dataset [12, 29, 42]. Most approaches will generate only one result per masked image.
Intra-Image Completion Traditional intra-image completion, such as diffusion-based methods [4, 1, 22] and patch-based methods [5, 7, 8, 3], assume image holes share similar content to visible regions; thus they would directly match, copy and realign the background patches to complete the holes. These methods perform well for background completion, e.g. for object removal, but cannot hallucinate unique content not present in the input images.
Inter-Image Completion To generate semantically new content, inter-image completion borrows information from a large dataset. Hays and Efros [12] presented an image completion method using millions of images, in which the image most similar to the masked input is retrieved, and corresponding regions are transferred. However, this requires a high contextual match, which is not always available. Recently, learning-based approaches were proposed. Initial works [20, 30] focused on small and thin holes. Context encoders (CE) [29] handled 6464-sized holes using GANs [11]. This was followed by several CNN-based methods, which included combining global and local discriminators as adversarial loss [14], identifying closest features in the latent space of masked images [40], utilizing semantic labels to guide the completion network [36], introducing additional face parsing loss for face completion [23], and designing particular convolutions to address irregular holes [24, 41]. A common drawback of these methods is that they often create distorted structures and blurry textures inconsistent with the visible regions, especially for large holes.
Combined Intra- and Inter-Image Completion To overcome the above problems, Yang et al. [39] proposed multi-scale neural patch synthesis, which generates high-frequency details by copying patches from mid-layer features. However, this optimization is computational costly. More recently, several works [42, 38, 35] exploited spatial attention [16, 46] to get high-frequency details. Yu et al. [42] presented a contextual attention layer to copy similar features from visible regions to the holes. Yan et al. [38] and Song et al. [35] proposed PatchMatch-like ideas on feature domain. However, these methods identify similar features by comparing features of holes and features of visible regions, which is somewhat contradictory as feature transfer is unnecessary when two features are very similar, but when needed the features are too different to be matched easily. Furthermore, distant information is not used for new content that differs from visible regions. Our model will solve this problem by extending self-attention [43] to harness abundant context.
Image Generation Image generation has progressed significantly using methods such as VAE [19] and GANs [11]. These have been applied to conditional image generation tasks, such as image translation [15], synthetic to realistic [44], future prediction [27], and 3D models [28]. Perhaps most relevant are conditional VAEs (CVAE) [34, 37] and CVAE-GAN [2], but these were not specially targeted for image completion. CVAE-based methods are most useful when the conditional labels are few and discrete, and there are sufficient training instances per label. Some recent work utilizing these in image translation can produce diverse output [47, 21], but in such situations the condition-to-sample mappings are more local (e.g. pixel-to-pixel), and only change the visual appearance. This is untrue for image completion, where the conditional label is itself the masked image, with only one training instance of the original holes. In [6], different outputs were obtained for face completion by specifying facial attributes (e.g. smile), but this method is very domain specific, requiring targeted attributes.
3 Approach
Suppose we have an image, originally , but degraded by a number of missing pixels to become (the masked partial image) comprising the observed / visible pixels. We also define as its complement partial image comprising the ground truth hidden pixels. Classical image completion methods attempt to reconstruct the ground truth unmasked image in a deterministic fashion from (see fig. 2 “Deterministic”). This results in only a single solution. In contrast, our goal is to sample from .
3.1 Probabilistic Framework
In order to have a distribution to sample from, a current approach is to employ the CVAE [34] which estimates a parametric distribution over a latent space, from which sampling is possible (see fig. 2 “CVAE”). This involves a variational lower bound of the conditional log-likelihood of observing the training instances:
[TABLE]
where is the latent vector, the posterior importance sampling function, the conditional prior, the likelihood, with , and being the deep network parameters of their corresponding functions. This lower bound is maximized w.r.t. all parameters.
For our purposes, the chief difficulty of using CVAE [34] directly is that the high DoF networks of and are not easily separable in (3.1) with the KL distance easily driven towards zero, and is approximately equivalent to maximizing (the “GSNN” variant in [34]). This consequently learns a delta-like prior of , where is the maximum latent likelihood point of . While this low variance prior may be useful in estimating a single solution, sampling from it will lead to negligible diversity in image completion results (as seen in fig. 9). When the CVAE variant of [37], which has a fixed latent prior, is used instead, the network learns to ignore the latent sampling and directly estimates from , also resulting in a single solution. This is due to the image completion scenario when there is only one training instance per condition label, which is a partial image . Details are in the supplemental section B.1.
A possible way to diversify the output is to simply not incentivize the output to reconstruct the instance-specific during training, only needing it to fit in with the training set distribution as deemed by an learned adversarial discriminator (see fig. 2 “Instance Blind”). However, this approach is unstable, especially for large and complex scenes [35].
Latent Priors of Holes In our approach, we require that missing partial images, as a superset of full images, to also arise from a latent space distribution, with a smooth prior of . The variational lower bound is:
[TABLE]
where in [19] the prior is set as . However, we can be more discerning when it comes to partial images since they have different numbers of pixels. A missing partial image with more pixels (larger holes) should have greater latent prior variance than a missing partial image with fewer pixels (smaller holes). Hence we generalize the prior to adapt to the number of pixels .
Prior-Conditional Coupling
Next, we combine the latent priors into the conditional lower bound of (3.1). This can be done by assuming is much more closely related to than to , so q_{\psi}(\mathbf{z}_{c}|\mathbf{I}_{c},\mathbf{I}_{m})$$\approx$$q_{\psi}(\mathbf{z}_{c}|\mathbf{I}_{c}). Updating (3.1):
[TABLE]
However, unlike in (3.1), notice that is no longer freely learned during training, but is tied to its presence in (3.1). Intuitively, the learning of is regularized by the prior in (3.1), while the learning of the conditional prior is in turn regularized by in (3.1).
Reconstruction vs Creative Generation
One issue with (3.1) is that the sampling is taken from during training, but is not available during testing, whereupon sampling must come from which may not be adequately learned for this role. In order to mitigate this problem, we modify (3.1) to have a blend of formulations with and without importance sampling. So, with simplified notation:
[TABLE]
where is implicitly set by training loss coefficients in section 3.3. When sampling from the importance function , the full training instance is available and we formulate the likelihood to be focused on reconstructing . Conversely, when sampling from the learned conditional prior which does not contain , we facilitate creative generation by having the likelihood model be independent of the original instance of . Instead it only encourages generated samples to fit in with the overall training distribution.
Our overall training objective may then be expressed as jointly maximizing the lower bounds in (3.1) and (3.1), with the likelihood in (3.1) unified to that in (3.1) as . See the supplemental section B.2.
3.2 Dual Pipeline Network Structure
This formulation is implemented as our dual pipeline framework, shown in fig. 3. It consists of two paths: the upper reconstructive path uses information from the whole image, i.e. =, while the lower generative path only uses information from visible regions . Both representation and generation networks share identical weights. Specifically:
- •
For the upper reconstructive path, the complement partial image is used to infer the importance function = during training. The sampled latent vector thus contains information of the missing regions, while the conditional feature encodes the information of the visible regions. Since there is sufficient information, the loss function in this path is geared towards reconstructing the original image .
- •
For the lower generative path, which is also the test path, the latent distribution of the holes is inferred based only on the visible . This would be significantly less accurate than the inference in the upper path. Thus the reconstruction loss is only targeted at the visible regions (via ).
- •
In addition, we also utilize adversarial learning networks on both paths, which ideally ensure that the full synthesized data fit in with the training set distribution, and empirically leads to higher quality images.
3.3 Training Loss
Various terms in (3.1) and (3.1) may be more conventionally expressed as loss functions. Jointly maximizing the lower bounds is then minimizing a total loss , which consists of three groups of component losses:
[TABLE]
where the group regularizes consistency between pairs of distributions in terms of KL divergences, the group encourages appearance matching fidelity, and while the group forces sampled images to fit in with the training set distribution. Each of the groups has a separate term for the reconstructive and generative paths.
Distributive Regularization
The typical interpretation of the KL divergence term in a VAE is that it regularizes the learned importance sampling function to a fixed latent prior . Defining as Gaussians, we get:
[TABLE]
For the generative path, the appropriate interpretation is reversed: the learned conditional prior , also a Gaussian, is regularized to .
[TABLE]
Note that the conditional prior only uses , while the importance function has access to the hidden .
Appearance Matching Loss
The likelihood term may be interpreted as probabilistically encouraging appearance matching to the hidden . However, our framework also auto-encodes the visible deterministically, and the loss function needs to cater for this reconstruction. As such, the per-instance loss here is:
[TABLE]
where = and are the reconstructed and original full images respectively. In contrast, for the generative path we ignore instance-specific appearance matching for , and only focus on reconstructing (via ):
[TABLE]
where = is the generated image from the sample, and is the binary mask selecting visible pixels.
Adversarial Loss
The formulation of and the instance-blind also incorporates the use of adversarially learned discriminators and to judge whether the generated images fit into the training set distribution. Inspired by [2], we use a mean feature match loss in the reconstructive path for the generator,
[TABLE]
where is the feature output of the final layer of . This encourages the original and reconstructed features in the discriminator to be close together. Conversely, the adversarial loss in the generative path for the generator is:
[TABLE]
This is based on the generator loss in LSGAN [26], which performs better than the original GAN loss [11] in our scenario. The discriminator loss for both and is also based on LSGAN.
3.4 Short+Long Term Attention
Extending beyond the Self-Attention GAN [43], we propose not only to use the self-attention map within a decoder layer to harness distant spatial context, but also to further capture feature-feature context between encoder and decoder layers. Our key novel insight is: doing so would allow the network a choice of attending to the finer-grained features in the encoder or the more semantically generative features in the decoder, depending on circumstances.
Our proposed structure is shown in fig. 4. We first calculate the self-attention map from the features of a decoder middle layer, using the attention score of:
[TABLE]
is the number of pixels, =, and is a 1x1 convolution filter. This leads to the short-term intra-layer attention feature (self-attention in fig. 4) and the output :
[TABLE]
where, following [43], we use a scale parameter to balance the weights between and . The initial value of is set to zero. In addition, for attending to features from an encoder layer, we have a long-term inter-layer attention feature (contextual flow in fig. 4) and the output :
[TABLE]
As before, a scale parameter is used to combine the encoder feature and the attention feature . However, unlike the decoder feature which has information for generating a full image, the encoder feature only represents visible parts . Hence, a binary mask (holes=0) is used. Finally, both the short and long term attention features are aggregated and fed into further decoder layers.
4 Experimental Results
We evaluated our proposed model on four datasets including Paris [9], CelebA-HQ [25, 17], Places2 [45], and ImageNet [31] using the original training and test splits for those datasets. Since our model can generate multiple outputs, we sampled images for each masked image, and chose the top 10 results based on the discriminator scores. We trained our models for both regular and irregular holes. For brevity, we refer to our method as PICNet. We provide PyTorch implementations and interactive demo.
4.1 Implementation Details
Our generator and discriminator networks are inspired by SA-GAN [43], but with several important modifications, including the short+long term attention layer. Furthermore, inspired by the growing-GAN [17], multi-scale output is applied to make the training faster.
The image completion network, implemented in Pytorch v0.4.0, contains 6M trainable parameters. During optimization, the weights of different losses are set to =20, =1. We used Orthogonal Initialization [33] and the Adam solver [18]. All networks were trained from scratch, with a fixed learning rate of =. Details are in the supplemental section D.
4.2 Comparison with Existing Work
Quantitative Comparisons
Quantitative evaluation is hard for the pluralistic image completion task, as our goal is to get diverse but reasonable solutions for one masked image. The original image is only one solution of many, and comparisons should not be made based on just this image.
However, just for the sake of obtaining quantitative measures, we will assume that one of our top 10 samples (ranked by the discriminator) will be close to the original ground truth, and select the single sample with the best balance of quantitative measures for comparison. The comparison is conducted on ImageNet test images, with quantitative measures of mean loss, peak signal-to-noise ration (PSNR), total variation (TV), and Inception Score (IS) [32]. We used a mask in the center.
**Qualitative Comparisons **First, we show the results in fig. 5 on the Paris dataset [9]. For fair comparison among learning-based methods, we only compared with those trained on this dataset. PatchMatch [3] worked by copying similar patches from visible regions and obtained good results on this dataset with repetitive structures. Context Encoder (CE) [29] generated reasonable structures with blurry textures. Shift-Net [38] made improvements by feature copying. Compared to these, our model not only generated more natural images, but also with multiple solutions, e.g. different numbers of windows and varying door sizes.
Next, we evaluated our methods on CelebA-HQ face dataset, with fig. 6 showing examples with large regular holes to highlight the diversity of our output. Context Attention (CA) [42] generated reasonable completion for many cases, but for each masked input they were only able to generate a single result; furthermore, on some occasions, the single solution may be poor. Our model produced various plausible results by sampling from the latent space conditional prior.
Finally, we report the performance on the more challenging ImageNet dataset by comparing to the previous PatchMatch [3], CE [29], GL [14] and CA [42]. Different from the CE and GL models that were trained on the k subset of training images of ImageNet, our model is directly trained on original ImageNet training dataset with all images resized to . Visual results on a variety of objects from the validation set are shown in fig. 7. Our model was able to infer the content quite effectively.
4.3 Ablation Study
Our PICNet vs CVAE vs “Instance Blind” vs BicycleGAN
We investigated the influence of using our two-path training structure in comparison to other variants such as the CVAE [34] and “instance blind” structures in fig. 2. We trained the three models using common parameters. As shown in fig. 9, for the CVAE, even after sampling from the latent prior distribution, the outputs were almost identical, as the conditional prior learned is narrowly centered at the maximum latent likelihood solution. As for “instance blind”, if reconstruction loss was used only on visible pixels, the training may become unstable. If we used reconstruction loss on the full generated image, there is also little variation as the framework has likely learned to ignore the sampling and predicted a deterministic outcome purely from .
We also trained and tested BicycleGAN [47] for center masks. As is obvious in fig. 8, BicycleGAN is not directly suitable, leading to poor results or minimal variation.
**Diversity Measure ** We computed diversity scores using the LPIPS metric reported in [46]. The average score is calculated between 50K pairs generated from a sampling of 1K center-masked images. and are the full output and mask-region output, respectively. While [46] obtained relatively higher diversity scores (still lower than ours), most of their generated images look unnatural (fig. 8).
**Short+Long Term Attention vs Contextual Attention ** We visualized our attention maps as in [43]. To compare to the contextual attention (CA) layer [42], we retrained CA on the Paris dataset via the authors’ code, and used their publicly released face model. The CA attention maps are presented in their color-directional format. As shown in fig. 10, our short+long term attention layer borrowed features from different positions with varying attention weights, rather than directly copying similar features from just one visible position. For the building scene, CA’s results were of similar high quality to ours, due to the repeated structures present. However for a face with a large mask, CA was unable to borrow features for the hidden content (e.g. mouth, eyes) from visible regions, with poor output. Our attention map is able to utilize both decoder features (which do not have masked parts) and encoder features as appropriate.
5 Conclusion
We proposed a novel dual pipeline training architecture for pluralistic image completion. Unlike existing methods, our framework can generate multiple diverse solutions with plausible content for a single masked input. The experimental results demonstrate this prior-conditional lower bound coupling is significant for conditional image generation. We also introduced an enhanced short+long term attention layer which improves realism. Experiments on a variety of datasets showed that our multiple solutions were diverse and of high-quality, especially for large holes.
Acknowledgements
This research is supported by the BeingTogether Centre, a collaboration between Nanyang Technological University (NTU) Singapore and University of North Carolina (UNC) at Chapel Hill. The BeingTogether Centre is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its International Research Centres in Singapore Funding Initiative. This research was also conducted in collaboration with Singapore Telecommunications Limited and partially supported by the Singapore Government through the Industry Alignment Fund ‐- Industry Collaboration Projects Grant.
Appendix A A Additional Examples
We first show our results on center hole completion, in relation to those from other methods trained on corresponding datasets. As for random irregular and regular holes, we simply present our results so that readers may appreciate the multiple diverse results we can get with differently sized and shaped holes. Finally, we show the interesting application on face editing.
A.1 Comparison with Existing Work on Center Hole Completion
A.2 Additional Results on Random and Irregular Hole Completion
A.3 Additional Results on Free-Form Mask Using Our Interactive Demo
A.4 Video for Additional Results
Besides this document, we also included two video clips of additional results as part of the supplemental material. The first video, shows free-from mask results on various datasets. The second video consists of four parts to show multiple examples of center hole completion, random hole completion, comparison results with different training strategies and face editing of my self-portraits.
Appendix B B Mathematical Derivation and Analysis
B.1 Difficulties with Using the Classical CVAE for Image Completion
Here we elaborate on the difficulties encountered when using the classical CVAE formulation for pluralistic image completion, expanding on the shorter description in section 3.1.
B.1.1 Background: Derivation of the Conditional Variational Auto-Encoder
(CVAE)
The broad CVAE framework of Sohn et al. [34] is a straightforward conditioning of the classical VAE. Using the notation in our main paper, a latent variable is assumed to stochastically generate the hidden partial image . When conditioned on the visible partial image , we get the conditional probability:
[TABLE]
The variance of the Monte Carlo estimate can be reduced by importance sampling to get
[TABLE]
Taking logs and apply Jensen’s inequality leads to
[TABLE]
The variational lower bound totaled over all training data is jointly maximized w.r.t. the network parameters , and in attempting to maximize the total log likelihood of the observed training instances.
B.1.2 Single Instance Per Conditioning Label
As is typically the case for image completion, there is only one training instance of for each unique . This means that for the function , can simply be learnt into the network as a hardcoded dependency of the input , so . Assuming that the network for has similar or higher modeling power and there are no other explicit constraints imposed on it, then in training , and the KL divergence in (B.3) goes to zero.
In this situation of zero KL divergence, we can rewrite the variational lower bound and replace with without loss of generality, as
[TABLE]
B.1.3 Unconstrained Learning of the Conditional Prior
We can analyze how can be maximized, by using Jensen’s inequality again (reversing earlier use)
[TABLE]
By further applying Hölder’s inequality (i.e. for ), we get
[TABLE]
Assuming that there is a unique global maximum for , the bound achieves equality when the conditional prior becomes a Dirac delta function centered at the maximum latent likelihood point
[TABLE]
Intuitively, subject to the vagaries of stochastic gradient descent, the network for without further constraints will learn a narrow delta-like function that sifts out maximum latent likelihood value of .
As mentioned in section 3.1, although this narrow conditional prior may be helpful in estimating a single solution for given during testing during testing, this is poor for sampling a diversity of solutions. In our framework, the (unconditional) latent priors are imposed for the partial images themselves, which prevent this delta function degeneracy.
B.1.4 CVAE with Fixed Prior
An alternative CVAE variant [37] assumes that conditional prior is independent of the and fixed, so , where is a fixed distribution (e.g. standard normal). This means
[TABLE]
Now we can consider the case for a fixed , and rewrite (B.8) as
[TABLE]
Doing so makes it obvious we can then derive the standard (unconditional) VAE formulation from here. Thus an appropriate interpretation of this CVAE variant is that it uses as a “switch” parameeter to choose between different VAE models that are trained for the specific conditions.
Once again, this is fine if there are multiple training instances per conditional label. However, in the image completion problem, there is only one per unique , so the condition-specific VAE model will simply ignore the sampling “noise” and learn to predict the single instance of from directly, i.e. , which incidentally achieves equality for the variational lower bound. This results in negligible variation of output despite now sampling from .
Our framework resolves this in part by defining all (unconditional) partial images of as sharing a common latent space with adaptive priors, with the likelihood parameters learned as an unconditional VAE, and further coupling on the conditional portion (i.e. the generative path) to get a more distinct but regularized estimate for .
B.2 Joint Maximization of Unconditional and Conditional Variational Lower
Bounds
The overall training loss function (5) used in our framework has a direct link to jointly maximizing the unconditional and unconditional variational lower bounds, respectively expressed by (3.1) and (3.1). Using simplified notation, we rewrite these bounds respectively as:
[TABLE]
To clarify, is the lower bound related to the unconditional log likelihood of observing , while relates to the log likelihood of observing conditioned on . The expression of reflects a blend of conditional likelihood formulations with and without the use of importance sampling, which are matched to different likelihood models, as explained in section 3.1. Note that the coefficient from (3.1) is left out here for simplicity, but there is no loss of generality since we can ignore a constant factor of the true lower bound if we are simply maximizing it.
We can then define a combined objective function as our maximization goal
[TABLE]
with .
To understand the relation between in (B.11) and in (5), we consider the equivalence of:
[TABLE]
Comparing terms
[TABLE]
For the reconstructive path that involves sampling from the (posterior) importance function of (3.1), we can substitute and get the reconstructive log likelihood formulation as
[TABLE]
Here, is available, with reconstructing both and as in (8), while involves GAN-based pairwise feature matching (10).
For the generative path that involves sampling from the conditional prior , we have the generative log likelihood formulation as
[TABLE]
As explained in sections 3.1 and 3.2, the generative path does not have direct access to , and this is reflected in the likelihood in which the instances of are ignored. Thus is only for reconstructing in a deterministic auto-encoder fashion as per (9), while in (11) only tries to enforce that the generated distribution be consistent with the training set distribution (hence without per-instance knowledge), as implemented in the form of a GAN.
Appendix C C Architectural Details
Our pluralistic image completion network (PICNet) architecture is inspired by SA-GAN [43] and BigGAN, but features several important modifications that enable us to train for this image-conditional generation task. We first replace the batch normalization with instance normalization in the generation network (ResBlock up in Fig. C.7), and remove the batch normalization in our other networks, (i.e. the representation, inference and discriminator networks comprising ResBlock start and ResBlock in Fig. C.7), because different holes will affect the means and variances in each batch. ResBlock down is similar to ResBlock, in which we add the average pooling layer after Conv and Conv.
The Infer1 network only consists of one Residual Block, for self-inferring the latent distribution of the ground truth (treated as known in the reconstructive path), while the Infer2 network consists of seven Residual Blocks, which are applied to predict the latent distribution of (treated as unknown in the generative path) based on the visible pixels .
Appendix D D Experimental Details
Our network is implemented in Pytorch v0.4.0, and employs the architectures of Appendix C. To reduce memory cost, we restrained the feature channel width to and selected . We experimented with different channels with largest being , but found that the improvement was not obvious. In addition, we applied the self-attention layer of the discriminator and the short+long term attention layer of the generator on a feature size. Spectral Normalization is used in all networks. All networks are initialized with Orthogonal Initialization and trained from scratch with a fixed learning rate of . We used the Adam optimizer with and .
The final weights we used were =20, =1. The KL loss and appearance matching loss weights come from the variational lower bound. Since the appearance matching loss is used in four output scales, the final weight for the KL loss is , where is the number of output scales. We also tried different values of and , and found that the bigger the KL loss weight, the greater the diversity of the generated , but it was also harder to retain the appearance consistency of the generated to the visible region . The values of and were obtained from -GAN. We experimented with the number of steps per step (varying it from 1 to 5), and found that one step per step gave the best results. When is smaller than 1, we can use two or four steps per step, but the full generated does not reconstruct the original conditional visible regions well. When is larger than 100, we needed two or four steps per step, if not the discriminator loss will become zero and the generated will be blurry.
We trained each model on a single GPU, with a batch size of 20 on a GTX 1080TI (11GB) and 32 on a NVIDIA V100 (16GB). Training models for centered holes of Paris and CelebA-HQ takes roughly 3 days, while for ImageNet and Places2 it takes roughly 2 weeks. On the other hand, training models for random irregular and un-centered holes takes about twice the time compared to models for centered holes. Moreover, since the prior distribution of random holes is changed with the number of pixels in each hole , the training loss may sometimes change abruptly due to the KL loss component.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Coloma Ballester, Marcelo Bertalmio, Vicent Caselles, Guillermo Sapiro, and Joan Verdera. Filling-in by joint interpolation of vector fields and gray levels. IEEE transactions on image processing , 10(8):1200–1211, 2001.
- 2[2] Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. Cvae-gan: Fine-grained image generation through asymmetric training. In 2017 IEEE International Conference on Computer Vision (ICCV) , pages 2764–2773. IEEE, 2017.
- 3[3] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (To G) , 28:24, 2009.
- 4[4] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques , pages 417–424. ACM Press/Addison-Wesley Publishing Co., 2000.
- 5[5] Marcelo Bertalmio, Luminita Vese, Guillermo Sapiro, and Stanley Osher. Simultaneous structure and texture image inpainting. IEEE transactions on image processing , 12(8):882–889, 2003.
- 6[6] Zeyuan Chen, Shaoliang Nie, Tianfu Wu, and Christopher G Healey. High resolution face completion with multiple controllable attributes via fully end-to-end progressive generative adversarial networks. ar Xiv preprint ar Xiv:1801.07632 , 2018.
- 7[7] Antonio Criminisi, Patrick Perez, and Kentaro Toyama. Object removal by exemplar-based inpainting. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on , volume 2, pages II–II. IEEE, 2003.
- 8[8] Antonio Criminisi, Patrick Pérez, and Kentaro Toyama. Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on image processing , 13(9):1200–1212, 2004.
