Generative Guiding Block: Synthesizing Realistic Looking Variants Capable of Even Large Change Demands
Minho Park, Hak Gu Kim, Yong Man Ro

TL;DR
This paper introduces a novel generative guiding block for realistic image synthesis that effectively handles large variations and deformations, enhancing the quality and diversity of generated images.
Contribution
The paper proposes a new generative guiding block with dual discriminators to improve large-variation image synthesis, a novel approach compared to existing methods.
Findings
Enhanced image realism and variation handling demonstrated in experiments
Outperforms state-of-the-art methods in qualitative and quantitative evaluations
Effective preservation of appearance despite large transformations
Abstract
Realistic image synthesis is to generate an image that is perceptually indistinguishable from an actual image. Generating realistic looking images with large variations (e.g., large spatial deformations and large pose change), however, is very challenging. Handing large variations as well as preserving appearance needs to be taken into account in the realistic looking image generation. In this paper, we propose a novel realistic looking image synthesis method, especially in large change demands. To do that, we devise generative guiding blocks. The proposed generative guiding block includes realistic appearance preserving discriminator and naturalistic variation transforming discriminator. By taking the proposed generative guiding blocks into generative model, the latent features at the layer of generative model are enhanced to synthesize both realistic looking- and target variation-…
| Model | SSIM | IS | |
|---|---|---|---|
| Ours w/o GGBs | 0.705 | 2.81 | |
| Ours w/o RAPD | 0.709 | 2.72 | |
| Ours w/o NVTD | 0.714 | 2.73 | |
| Ours with 1 GGB | 0.780 | 3.14 | |
| Ours with 2 GGBs | 0.793 | 3.15 | |
| Ours | 0.799 | 3.26 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
GENERATIVE GUIDING BLOCK: SYNTHESIZING REALISTIC LOOKING VARIANTS CAPABLE OF EVEN LARGE CHANGE DEMANDS
Abstract
Realistic image synthesis is to generate an image that is perceptually indistinguishable from an actual image. Generating realistic looking images with large variations (e.g., large spatial deformations and large pose change), however, is very challenging. Handing large variations as well as preserving appearance needs to be taken into account in the realistic looking image generation. In this paper, we propose a novel realistic looking image synthesis method, especially in large change demands. To do that, we devise generative guiding blocks. The proposed generative guiding block includes realistic appearance preserving discriminator and naturalistic variation transforming discriminator. By taking the proposed generative guiding blocks into generative model, the latent features at the layer of generative model are enhanced to synthesize both realistic looking- and target variation- image. With qualitative and quantitative evaluation in experiments, we demonstrated the effectiveness of the proposed generative guiding blocks, compared to the state-of-the-arts.
**Index Terms— ** Deep learning, adversarial learning, variation image synthesis, and feature enhancement
1 Introduction
Generating realistic-looking images draws great attention and considered as an important task in generative models for image synthesis. Recently, deep learning-based generative models have achieved remarkable success in various synthesis tasks such as face, human, and scene generation. In data acquisition, it is time consuming and costly to collect or capture the images with desired variations (e.g., pose, illumination, facial expression, and viewpoint). Generative models that can automatically synthesize images with the desired variations are needed in practice.
For generating realistic-looking images of objects, it is required to understand both their appearance and variants. The object has inherent appearance properties characterized by color and texture such as hair color and fashion style. On the other hand, there are variants including the shape and geometrical layout of the object. One of the most challenging points in the image generation is to preserve the appearance properties of input image (e.g., color, texture, the identity of person) while performing spatial deformation according to variants (e.g., pose variation and illumination variation).
For this task, so far, various methods have been proposed based on Variational Auto-Encoders (VAEs) [1], Generative Adversarial Networks (GANs) [2] and Autoregressive models (ARMs) (e.g., PixelRNN [3]) [4, 5, 6, 7, 8, 9, 10, 11, 12]. Recently, a wide range of methods including conditional GANs [13] or conditional VAEs [14] have been proposed for synthesizing the images whose appearances depend on a given conditioning variable (e.g., label). However, most of them could not deal with the large variations (e.g., large spatial deformation [15]) between the input and the target image while preserving the appearance of a given input. Due to the high dimensionality of images and the complex configuration of image contents, it is difficult for a complete end-to-end framework to generate both the correct target variation and the detailed appearance simultaneously [16, 17, 18, 19].
In this paper, we focus on realistic appearance and naturalistic variation in target image generation. The generative features are enhanced with appearance preservation and variant transformation. Our objective is to propose new generation method that addresses two problems, which are realistic appearance and naturalistic large-variation. To cope with the problems, we propose a novel generative guiding blocks (GGBs). Each generative guiding block consists of realistic appearance preserving discriminator (RAPD) and naturalistic variation transforming discriminator (NVTD). In the proposed RAPD, to preserve the object appearance of input image (e.g., identity of person), the overall image distribution is considered by determining whether the appearance is preserved in the target image or not. Simultaneously, in the proposed NVTD, to generate the target image with large variation, the change information of deformation is considered by focusing on the variation between the input and the generated target image. We hierarchically integrate the proposed GGBs with the decoding module of the generator to enhance generative feature in multiple resolution levels. The proposed generative model with GGBs enables to synthesize the realistic-looking image robustly even with large variations while maintaining naturalistic variants. Experimental results showed the effectiveness of the proposed GGBs.
The rest of this paper is organized as follows. In section 2, we describe the proposed generative model with GGBs. In section 3, the experimental results are presented. Finally, conclusion is drawn in section 4.
2 PROPOSED METHOD
Fig. 1 shows the proposed generative model with generative guiding blocks (GGBs). The generator synthesizes the fake image having the appearance of the input image and the target variants. The discriminator determines whether the fake image is real or not. As shown in Fig. 1, the generative guiding blocks (GGBs) are attached to multi-level generative features of multiple layers in the decoder of generator. The GGBs determine whether the generated multi-resolution images have realistic appearance (operated by RAPD in GGB) and naturalistic variation (operated by NVTD in GGB). Variant transformation is performed hierarchically in a multi-resolution manner so that the proposed generator can process large variant demand. In the following subsections, we describe in detail about the generator, discriminator and GGBs.
2.1 Generative model with discriminator
Let denote the input image and denote the ground-truth target image. denotes the target variation and (i.e. ) denotes the generated image. Let denote -th generative feature. Let denote the generator, denote the discriminator and denote the label map which is encoded from . By encoding , abundant condition information of the desired variation is provided to the . In this paper, a U-Net-like structure is employed as [20, 21]. The encoder and decoder of consist of 7 convolution layers and deconvolution layers, respectively (i.e. =7) with kernel and stride of 2. consists of 5 convolution layers with kernel and stride of 2.
With an adversarial learning [2], determines whether the ̂isarealistic-lookingornot,comparingwith.Theobjectivefunctionsofcanbewrittenas\begin{equation}\begin{aligned} \mathcal{L}{D}=&-\operatorname{\mathbb{E}}{\mathbf{y}\sim p_{\mathbf{y}}}[\text{log}(D(\mathbf{y}))]\ &-\operatorname{\mathbb{E}}{\mathbf{x}\sim p{\mathbf{x}}}[\text{log}(1-D(G(\mathbf{x},c)))].\end{aligned}\end{equation}\par Ontheotherhand,triestofoolbygeneratingtherealisticimage.Tothatend,thelossofthegeneratoriscomposedoftwoterms,whicharetherealismloss,,andthereconstructionloss,.Therealismlosscanbewrittenas\begin{equation}\mathcal{}\ell_{real}=-\operatorname{\mathbb{E}}{\mathbf{x}\sim p{\mathbf{x}}}[\text{log}(D(G(\mathbf{x},c)))]\end{equation}\par Thereconstructionlossbetweentheground-truthtargetimageandthegeneratedimageat-thlevel,,inthedecodercanbewrittenas\begin{equation}\mathcal{}\ell^{n}{rec}=\operatorname{\mathbb{E}}{\mathbf{x}\sim p_{\mathbf{x}}}[|\mathbf{y}^{n}-\hat{\mathbf{x}}^{n}|{1}],\end{equation}whereindicatesageneratedimagefromandindicatesanimagedownsizedtothesameresolutionoffrom(asshowninFig.2).\par Finally,thetotallossfunctionoftheproposedgenerator,,canbedefinedasacombinationoftherealismlossandthereconstructionloss.\begin{equation}\mathcal{L}{G}=\lambda_{real}\ell_{real}+\ell^{N}_{rec},\end{equation}whereisaweightparametertocontrolthebalancebetweenand.\par\begin{figure}[t]\centerline{\hbox{\includegraphics[width=411.93767pt]{figure/figure2.pdf}}}\vspace{-3mm}@@toccaption{{\lx@tag[ ]{{2}}{Thearchitectureoftheproposed-thGGB.}}}@@caption{{\lx@tag[: ]{{{\bf Fig.\ 2}}}{Thearchitectureoftheproposed-thGGB.}}}\vspace{-4.5mm}\end{figure}\par$
2.2 Generative Guiding Block for realistic appearance and naturalistic variation
Fig. 2 shows the architecture of the proposed -th GGB, which consists of a realistic appearance preserving discriminator (RAPD), , and a naturalistic variation transforming discriminator (NVTD), . The GGBs are attached on the multi-level generative features of multiple layers in the decoder as shown in Fig. 1. Let denote an image downsized to the same resolution of from . Let denote the feature encoder. In this paper, and consist of 3 convolution layers. The feature encoder consists of 2 convolution layers with 44 kernel and stride of 2.
First, to deal with feature information of , and , the images are encoded to the latent feature, , and . After that, distinguishes whether the encoded features, and , are realistic or not. As shown in Fig. 2, distinguishes whether the residual information of encoded features (i.e., and ) is realistic or not. The reason that the input of is residual information is to make focus on only the target variation. tries to fool , so that mimics the data distribution of . Through this process, is enhanced for generating appearance realistic image. Also, tries to fool , so that tries to follow . is enhanced for generating the image with naturalistic variation as well.
The discriminators in GGB, and , are trained by adversarial learning with . Therefore, we adopt generative adversarial loss. First, the objective function of is defined as
[TABLE]
where indicates in -th GGB. Similarly, the objective function of is defined as
[TABLE]
where indicates in -th GGB.
and are trained to minimize and , respectively. Contrary, with GGBs is trained to minimize and for learning to fool and . These objective functions can be written as
[TABLE]
[TABLE]
In particular, to preserve the appearance information, we adopt the L1 norm as our reconstruction loss, Eq. 3. Finally, the objective function of G with our GGBs is defined as
[TABLE]
where is used for weighted sum of multi-level GGB losses.
2.3 Training strategy
Every iteration, and are given to . Then, generates ̂.Intheiscalculatedwitĥ and (see Eq.1). In the -th GGB, and are calculated with , and (see Eq.5 and 6). After that, the weights of are updated to minimize . Also, the weights of -th GGB are updated to minimize and (=1,2,…,-1). The weights of except for are firstly updated to minimize (see Eq. 9). Finally, the weights of are updated to minimize (see Eq. 4). Until the weights are optimized, this process is repeated.
3 EXPERIMENTS AND RESULTS
3.1 Datasets
For verifying the effectiveness of the proposed generative model with GGBs, we used public datasets: DeepFashion [22]. This dataset consists of 52,712 in-shop clothes images with 256256 resolution. As similar to [16], for the training set, we have 146,680 pairs. Each pair is composed of two images of the same identity but different poses. For the test set, we randomly selected 12,800 pairs from the test set. To use the human pose landmark of DeepFashion data as the target variation, we applied a state-of-the-art pose estimation [23], as in [16].
3.2 Implementation details
We used Adam optimizer [24] with = 0.5, = 0.999, the batch size of 8, and learning rate of 0.0002 to train proposed models. In our experiment, we attached three GGBs on the generative features with 32 32, 64 64 and 128 128 resolutions (i.e. , and ). We empirically set = 0.02 and = = 0.01.
3.3 Performance evaluation
Fig. 3 shows comparison between generated images by our model and those by the state-of-the-art model, PG2[16]. To obtain the results of PG2, we used pretrained weight provided by the author of PG2. As shown in Fig. 3, in the results of PG2, hair and clothes were blurred a lot. Thus the appearance information was not preserved well. On the other hand, the appearances were preserved well in ours. Fig. 4 shows the effectiveness of refining multi-level features using GGBs. ’1 GGB’ indicates the generative model with only 6-th GGB. ’2 GGBs’ indicates the generative model with 5-th and 6-th GGBs. ’3 GGBs’ indicates the generative model with 4-th, 5-th and 6-th GGBs, same as proposed model. The more GGBs were used in generative model training, the clearer the images and the better the appearance preserved. Table 1 and 2 show the quantitative results of state-of-the-art models [16, 17, 18, 19] and the proposed model by measuring Structural Similarity (SSIM) [25] and Inception scores (IS) [7]. As seen in Table 1, the proposed method outperformed the state-of-the-art method. In table 2, ’w/o GGBs’ indicates training generative model without any GGB. ’w/o RAPD’ and ’w/o NVTD’ indicate that there are only NVTD and RAPD in GGB, respectively. As seen in Table 2, the proposed model (i.e. 3 GGBs are used, RAPD and NVTD in GGB) provided the highest performance.
4 CONCLUSION
In this paper, we proposed a novel Generative Guiding Block for synthesizing realistic looking images with the large variations while preserving the appearance properties. The proposed GGB consisted of two critic networks which were RAPD for maintaining the appearance characteristic and NVTD for applying the target variants. By hierarchically integrating the proposed GGBs with the generator, the proposed GGBs could enhance the generative features in the decoder from coarse to fine. The experimental results showed that the proposed method outperformed the state-of-the-art methods. Also, the effectiveness of components of GGB (i.e. RAPD and NVTD) and hierarchical multi-level features were shown.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] D. P. Kingma and M. Welling, “Auto-encoding variational bayes.,” Co RR , vol. abs/1312.6114, 2013.
- 2[2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27 , pp. 2672–2680. Curran Associates, Inc., 2014.
- 3[3] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in ICML , 2016, pp. 1747–1756.
- 4[4] R. A. Yeh ∗ , C. Chen ∗ , T. Y. Lim, A. G. Schwing, M. Hasegawa Johnson, and M. N. Do, “Semantic image inpainting with deep generative models,” in CVPR , 2017, ∗ equal contribution.
- 5[5] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras, “Neural face editing with intrinsic image disentangling,” in CVPR . IEEE, 2017, pp. –.
- 6[6] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros, “Context encoders: Feature learning by inpainting,” in CVPR , 2016.
- 7[7] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in NIPS , pp. 2234–2242. 2016.
- 8[8] D. Yoo, S. Park Kim, N. Kim, A. S. Paek, and I. Kweon, “Pixel-level domain transfer,” in ECCV , 10 2016, vol. 9912, pp. 517–532.
