Generative Guiding Block: Synthesizing Realistic Looking Variants   Capable of Even Large Change Demands

Minho Park; Hak Gu Kim; Yong Man Ro

arXiv:1907.01187·cs.CV·July 3, 2019

Generative Guiding Block: Synthesizing Realistic Looking Variants Capable of Even Large Change Demands

Minho Park, Hak Gu Kim, Yong Man Ro

PDF

TL;DR

This paper introduces a novel generative guiding block for realistic image synthesis that effectively handles large variations and deformations, enhancing the quality and diversity of generated images.

Contribution

The paper proposes a new generative guiding block with dual discriminators to improve large-variation image synthesis, a novel approach compared to existing methods.

Findings

01

Enhanced image realism and variation handling demonstrated in experiments

02

Outperforms state-of-the-art methods in qualitative and quantitative evaluations

03

Effective preservation of appearance despite large transformations

Abstract

Realistic image synthesis is to generate an image that is perceptually indistinguishable from an actual image. Generating realistic looking images with large variations (e.g., large spatial deformations and large pose change), however, is very challenging. Handing large variations as well as preserving appearance needs to be taken into account in the realistic looking image generation. In this paper, we propose a novel realistic looking image synthesis method, especially in large change demands. To do that, we devise generative guiding blocks. The proposed generative guiding block includes realistic appearance preserving discriminator and naturalistic variation transforming discriminator. By taking the proposed generative guiding blocks into generative model, the latent features at the layer of generative model are enhanced to synthesize both realistic looking- and target variation-…

Tables2

Table 1. Table 1 : Quantitative comparison with the state-of-the-art methods on DeepFashion dataset.

Model	SSIM	IS
Disentangled[17]	0.614	3.23
VariGAN[18]	0.620	3.03
PG²[16]	0.762	3.09
DPT[19]	0.769	3.17
Ours	0.799	3.26

Table 2. Table 2 : Effectiveness of using both RAPD/NVTD and multiple GGBs

Model	SSIM	IS
Ours w/o GGBs	0.705	2.81
Ours w/o RAPD	0.709	2.72
Ours w/o NVTD	0.714	2.73
Ours with 1 GGB	0.780	3.14
Ours with 2 GGBs	0.793	3.15
Ours	0.799	3.26

Equations12

L_{D_{R A P D}}^{n} =

L_{D_{R A P D}}^{n} =

- E_{x \sim p_{x}} [log (1 - D_{R A P D}^{n} (f (\hat{x}^{n})))],

L_{D_{N V T D}}^{n}

L_{D_{N V T D}}^{n}

- E_{x \sim p_{x}} [log (1 - D_{N V T D}^{n} (d_{f ak e}^{n}))],

ℓ_{R A P D}^{n} = - E_{x \sim p_{x}} [log (D_{R A P D}^{n} (f (\hat{x}^{n})))],

ℓ_{R A P D}^{n} = - E_{x \sim p_{x}} [log (D_{R A P D}^{n} (f (\hat{x}^{n})))],

ℓ_{N V T D}^{n} = - E_{x \sim p_{x}} [log (D_{N V T D}^{n} (d_{f ak e}^{n}))] .

ℓ_{N V T D}^{n} = - E_{x \sim p_{x}} [log (D_{N V T D}^{n} (d_{f ak e}^{n}))] .

L_{GGB} = n = 1 \sum N - 1 λ_{R A P D}^{n} ℓ_{R A P D}^{n} + λ_{N V T D}^{n} ℓ_{N V T D}^{n} + ℓ_{r ec}^{n},

L_{GGB} = n = 1 \sum N - 1 λ_{R A P D}^{n} ℓ_{R A P D}^{n} + λ_{N V T D}^{n} ℓ_{N V T D}^{n} + ℓ_{r ec}^{n},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

GENERATIVE GUIDING BLOCK: SYNTHESIZING REALISTIC LOOKING VARIANTS CAPABLE OF EVEN LARGE CHANGE DEMANDS

Abstract

Realistic image synthesis is to generate an image that is perceptually indistinguishable from an actual image. Generating realistic looking images with large variations (e.g., large spatial deformations and large pose change), however, is very challenging. Handing large variations as well as preserving appearance needs to be taken into account in the realistic looking image generation. In this paper, we propose a novel realistic looking image synthesis method, especially in large change demands. To do that, we devise generative guiding blocks. The proposed generative guiding block includes realistic appearance preserving discriminator and naturalistic variation transforming discriminator. By taking the proposed generative guiding blocks into generative model, the latent features at the layer of generative model are enhanced to synthesize both realistic looking- and target variation- image. With qualitative and quantitative evaluation in experiments, we demonstrated the effectiveness of the proposed generative guiding blocks, compared to the state-of-the-arts.

**Index Terms— ** Deep learning, adversarial learning, variation image synthesis, and feature enhancement

1 Introduction

Generating realistic-looking images draws great attention and considered as an important task in generative models for image synthesis. Recently, deep learning-based generative models have achieved remarkable success in various synthesis tasks such as face, human, and scene generation. In data acquisition, it is time consuming and costly to collect or capture the images with desired variations (e.g., pose, illumination, facial expression, and viewpoint). Generative models that can automatically synthesize images with the desired variations are needed in practice.

For generating realistic-looking images of objects, it is required to understand both their appearance and variants. The object has inherent appearance properties characterized by color and texture such as hair color and fashion style. On the other hand, there are variants including the shape and geometrical layout of the object. One of the most challenging points in the image generation is to preserve the appearance properties of input image (e.g., color, texture, the identity of person) while performing spatial deformation according to variants (e.g., pose variation and illumination variation).

For this task, so far, various methods have been proposed based on Variational Auto-Encoders (VAEs) [1], Generative Adversarial Networks (GANs) [2] and Autoregressive models (ARMs) (e.g., PixelRNN [3]) [4, 5, 6, 7, 8, 9, 10, 11, 12]. Recently, a wide range of methods including conditional GANs [13] or conditional VAEs [14] have been proposed for synthesizing the images whose appearances depend on a given conditioning variable (e.g., label). However, most of them could not deal with the large variations (e.g., large spatial deformation [15]) between the input and the target image while preserving the appearance of a given input. Due to the high dimensionality of images and the complex configuration of image contents, it is difficult for a complete end-to-end framework to generate both the correct target variation and the detailed appearance simultaneously [16, 17, 18, 19].

In this paper, we focus on realistic appearance and naturalistic variation in target image generation. The generative features are enhanced with appearance preservation and variant transformation. Our objective is to propose new generation method that addresses two problems, which are realistic appearance and naturalistic large-variation. To cope with the problems, we propose a novel generative guiding blocks (GGBs). Each generative guiding block consists of realistic appearance preserving discriminator (RAPD) and naturalistic variation transforming discriminator (NVTD). In the proposed RAPD, to preserve the object appearance of input image (e.g., identity of person), the overall image distribution is considered by determining whether the appearance is preserved in the target image or not. Simultaneously, in the proposed NVTD, to generate the target image with large variation, the change information of deformation is considered by focusing on the variation between the input and the generated target image. We hierarchically integrate the proposed GGBs with the decoding module of the generator to enhance generative feature in multiple resolution levels. The proposed generative model with GGBs enables to synthesize the realistic-looking image robustly even with large variations while maintaining naturalistic variants. Experimental results showed the effectiveness of the proposed GGBs.

The rest of this paper is organized as follows. In section 2, we describe the proposed generative model with GGBs. In section 3, the experimental results are presented. Finally, conclusion is drawn in section 4.

2 PROPOSED METHOD

Fig. 1 shows the proposed generative model with generative guiding blocks (GGBs). The generator synthesizes the fake image having the appearance of the input image and the target variants. The discriminator determines whether the fake image is real or not. As shown in Fig. 1, the generative guiding blocks (GGBs) are attached to multi-level generative features of multiple layers in the decoder of generator. The GGBs determine whether the generated multi-resolution images have realistic appearance (operated by RAPD in GGB) and naturalistic variation (operated by NVTD in GGB). Variant transformation is performed hierarchically in a multi-resolution manner so that the proposed generator can process large variant demand. In the following subsections, we describe in detail about the generator, discriminator and GGBs.

2.1 Generative model with discriminator

Let ${\mathbf{x}}\in{\rm I\!R}^{256\times 256\times 3}$ denote the input image and ${\mathbf{y}}\in{\rm I\!R}^{256\times 256\times 3}$ denote the ground-truth target image. $c$ denotes the target variation and $\hat{\mathbf{x}}\in{\rm I\!R}^{256\times 256\times 3}$ (i.e. $G(\mathbf{x},{c})$ ) denotes the generated image. Let $\mathbf{g}^{n}$ denote $n$ -th generative feature. Let $G$ denote the generator, $D$ denote the discriminator and ${\mathbf{M}_{c}}\in{\rm I\!R}^{256\times 256\times 3}$ denote the label map which is encoded from $c$ . By encoding $c$ , abundant condition information of the desired variation is provided to the $G$ . In this paper, a U-Net-like structure is employed as $G$ [20, 21]. The encoder and decoder of $G$ consist of 7 convolution layers and deconvolution layers, respectively (i.e. $N$ =7) with $4\times 4$ kernel and stride of 2. $D$ consists of 5 convolution layers with $4\times 4$ kernel and stride of 2.

With an adversarial learning [2], $D$ determines whether the $\hat{\mathbf{x}}$ ̂isarealistic-lookingornot,comparingwith $**y$ .Theobjectivefunctionsof $**D$ canbewrittenas\begin{equation}\begin{aligned} \mathcal{L}{D}=&-\operatorname{\mathbb{E}}{\mathbf{y}\sim p_{\mathbf{y}}}[\text{log}(D(\mathbf{y}))]\ &-\operatorname{\mathbb{E}}{\mathbf{x}\sim p{\mathbf{x}}}[\text{log}(1-D(G(\mathbf{x},c)))].\end{aligned}\end{equation}\par Ontheotherhand, $G$ triestofool $D$ bygeneratingtherealisticimage.Tothatend,thelossofthegeneratoriscomposedoftwoterms,whicharetherealismloss, $ℓ_real$ ,andthereconstructionloss, $ℓ_rec$ .Therealismlosscanbewrittenas\begin{equation}\mathcal{}\ell_{real}=-\operatorname{\mathbb{E}}{\mathbf{x}\sim p{\mathbf{x}}}[\text{log}(D(G(\mathbf{x},c)))]\end{equation}\par Thereconstructionlossbetweentheground-truthtargetimageandthegeneratedimageat $n$ -thlevel, $ℓ^n_rec$ ,inthedecodercanbewrittenas\begin{equation}\mathcal{}\ell^{n}{rec}=\operatorname{\mathbb{E}}{\mathbf{x}\sim p_{\mathbf{x}}}[|\mathbf{y}^{n}-\hat{\mathbf{x}}^{n}|{1}],\end{equation}where $^**x**^n$ indicatesageneratedimagefrom $**g**^n$ and $**y**^n$ indicatesanimagedownsizedtothesameresolutionof $^**x**^n$ from $**y$ (asshowninFig.2).\par Finally,thetotallossfunctionoftheproposedgenerator, $**G$ ,canbedefinedasacombinationoftherealismlossandthereconstructionloss.\begin{equation}\mathcal{L}{G}=\lambda_{real}\ell_{real}+\ell^{N}_{rec},\end{equation}where $*λ*_real$ isaweightparametertocontrolthebalancebetween $ℓ_real$ and $ℓ^N_rec$ .\par\begin{figure}[t]\centerline{\hbox{\includegraphics[width=411.93767pt]{figure/figure2.pdf}}}\vspace{-3mm}@@toccaption{{\lx@tag[ ]{{2}}{Thearchitectureoftheproposed $n$ -thGGB.}}}@@caption{{\lx@tag[: ]{{{\bf Fig.\ 2}}}{Thearchitectureoftheproposed $n$ -thGGB.}}}\vspace{-4.5mm}\end{figure}\par$

2.2 Generative Guiding Block for realistic appearance and naturalistic variation

Fig. 2 shows the architecture of the proposed $n$ -th GGB, which consists of a realistic appearance preserving discriminator (RAPD), ${D}_{RAPD}$ , and a naturalistic variation transforming discriminator (NVTD), ${D}_{NVTD}$ . The GGBs are attached on the multi-level generative features of multiple layers in the decoder as shown in Fig. 1. Let ${\mathbf{x}^{n}}$ denote an image downsized to the same resolution of $\hat{\mathbf{x}}^{n}$ from ${\mathbf{x}}$ . Let $f(\cdot)$ denote the feature encoder. In this paper, ${D_{RAPD}}$ and ${D_{NVTD}}$ consist of 3 convolution layers. The feature encoder consists of 2 convolution layers with 4 $\times$ 4 kernel and stride of 2.

First, to deal with feature information of ${\mathbf{x}^{n}}$ , $\hat{\mathbf{x}}^{n}$ and ${\mathbf{y}^{n}}$ , the images are encoded to the latent feature, ${f}({\mathbf{x}^{n}})$ , ${f}(\hat{\mathbf{x}}^{n})$ and ${f}({\mathbf{y}^{n}})$ . After that, ${D}_{RAPD}$ distinguishes whether the encoded features, ${f}(\hat{\mathbf{x}}^{n})$ and ${f}({\mathbf{y}^{n}})$ , are realistic or not. As shown in Fig. 2, ${D}_{NVTD}$ distinguishes whether the residual information of encoded features (i.e., $\mathbf{d}_{real}^{n}={f}({\mathbf{x}^{n}})-{f}({\mathbf{y}^{n}})$ and $\mathbf{d}_{fake}^{n}={f}({\mathbf{x}^{n}})-{f}(\hat{\mathbf{x}}^{n})$ ) is realistic or not. The reason that the input of ${D}_{NVTD}$ is residual information is to make ${D}_{NVTD}$ focus on only the target variation. $G$ tries to fool ${D}_{RAPD}$ , so that $\hat{\mathbf{x}}^{n}$ mimics the data distribution of ${\mathbf{y}^{n}}$ . Through this process, ${\mathbf{g}^{n}}$ is enhanced for generating appearance realistic image. Also, $G$ tries to fool ${D}_{NVTD}$ , so that $\mathbf{d}_{fake}^{n}$ tries to follow $\mathbf{d}_{real}^{n}$ . ${\mathbf{g}^{n}}$ is enhanced for generating the image with naturalistic variation as well.

The discriminators in GGB, ${D}_{RAPD}$ and ${D}_{NVTD}$ , are trained by adversarial learning with $G$ . Therefore, we adopt generative adversarial loss. First, the objective function of ${D}_{RAPD}$ is defined as

[TABLE]

where ${D_{RAPD}^{n}}$ indicates $D_{RAPD}$ in $n$ -th GGB. Similarly, the objective function of $D_{NVTD}$ is defined as

[TABLE]

where ${D_{NVTD}^{n}}$ indicates $D_{NVTD}$ in $n$ -th GGB.

${D_{RAPD}^{n}}$ and ${D_{NVTD}^{n}}$ are trained to minimize ${\mathcal{L}_{D_{RAPD}}^{n}}$ and ${\mathcal{L}_{D_{NVTD}}^{n}}$ , respectively. Contrary, $G$ with GGBs is trained to minimize ${\ell_{RAPD}^{n}}$ and ${\ell_{NVTD}^{n}}$ for learning to fool ${D_{RAPD}^{n}}$ and ${D_{NVTD}^{n}}$ . These objective functions can be written as

[TABLE]

In particular, to preserve the appearance information, we adopt the L1 norm as our reconstruction loss, Eq. 3. Finally, the objective function of G with our GGBs is defined as

[TABLE]

where $\Sigma$ is used for weighted sum of multi-level GGB losses.

2.3 Training strategy

Every iteration, $\mathbf{x}$ and $c$ are given to $G$ . Then, $G$ generates $\hat{\mathbf{x}}$ ̂.Inthe $D,L_D$ iscalculatedwith $^**x**$ ̂ and ${\mathbf{y}}$ (see Eq.1). In the $n$ -th GGB, ${\mathcal{L}_{D_{RAPD}}^{n}}$ and ${\mathcal{L}_{D_{NVTD}}^{n}}$ are calculated with $\mathbf{x}^{n}$ , $\mathbf{y}^{n}$ and $\hat{\mathbf{x}}^{n}$ (see Eq.5 and 6). After that, the weights of $D$ are updated to minimize $\mathcal{L}_{D}$ . Also, the weights of $n$ -th GGB are updated to minimize ${\mathcal{L}_{D_{RAPD}}^{n}}$ and ${\mathcal{L}_{D_{NVTD}}^{n}}$ ( $n$ =1,2,…, $N$ -1). The weights of $G$ except for $\mathbf{g}^{N}$ are firstly updated to minimize $\mathcal{L}_{GGB}$ (see Eq. 9). Finally, the weights of $G$ are updated to minimize $\mathcal{L}_{G}$ (see Eq. 4). Until the weights are optimized, this process is repeated.

3 EXPERIMENTS AND RESULTS

3.1 Datasets

For verifying the effectiveness of the proposed generative model with GGBs, we used public datasets: DeepFashion [22]. This dataset consists of 52,712 in-shop clothes images with 256 $\times$ 256 resolution. As similar to [16], for the training set, we have 146,680 pairs. Each pair is composed of two images of the same identity but different poses. For the test set, we randomly selected 12,800 pairs from the test set. To use the human pose landmark of DeepFashion data as the target variation, we applied a state-of-the-art pose estimation [23], as in [16].

3.2 Implementation details

We used Adam optimizer [24] with ${\beta}_{1}$ = 0.5, ${\beta}_{2}$ = 0.999, the batch size of 8, and learning rate of 0.0002 to train proposed models. In our experiment, we attached three GGBs on the generative features with 32 $\times$ 32, 64 $\times$ 64 and 128 $\times$ 128 resolutions (i.e. $\mathbf{g}^{4}$ , $\mathbf{g}^{5}$ and $\mathbf{g}^{6}$ ). We empirically set ${\lambda_{real}}$ = 0.02 and ${\lambda_{RAPD}^{n}}$ = ${\lambda_{NVTD}^{n}}$ = 0.01.

3.3 Performance evaluation

Fig. 3 shows comparison between generated images by our model and those by the state-of-the-art model, PG2[16]. To obtain the results of PG2, we used pretrained weight provided by the author of PG2. As shown in Fig. 3, in the results of PG2, hair and clothes were blurred a lot. Thus the appearance information was not preserved well. On the other hand, the appearances were preserved well in ours. Fig. 4 shows the effectiveness of refining multi-level features using GGBs. ’1 GGB’ indicates the generative model with only 6-th GGB. ’2 GGBs’ indicates the generative model with 5-th and 6-th GGBs. ’3 GGBs’ indicates the generative model with 4-th, 5-th and 6-th GGBs, same as proposed model. The more GGBs were used in generative model training, the clearer the images and the better the appearance preserved. Table 1 and 2 show the quantitative results of state-of-the-art models [16, 17, 18, 19] and the proposed model by measuring Structural Similarity (SSIM) [25] and Inception scores (IS) [7]. As seen in Table 1, the proposed method outperformed the state-of-the-art method. In table 2, ’w/o GGBs’ indicates training generative model without any GGB. ’w/o RAPD’ and ’w/o NVTD’ indicate that there are only NVTD and RAPD in GGB, respectively. As seen in Table 2, the proposed model (i.e. 3 GGBs are used, RAPD and NVTD in GGB) provided the highest performance.

4 CONCLUSION

In this paper, we proposed a novel Generative Guiding Block for synthesizing realistic looking images with the large variations while preserving the appearance properties. The proposed GGB consisted of two critic networks which were RAPD for maintaining the appearance characteristic and NVTD for applying the target variants. By hierarchically integrating the proposed GGBs with the generator, the proposed GGBs could enhance the generative features in the decoder from coarse to fine. The experimental results showed that the proposed method outperformed the state-of-the-art methods. Also, the effectiveness of components of GGB (i.e. RAPD and NVTD) and hierarchical multi-level features were shown.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. P. Kingma and M. Welling, “Auto-encoding variational bayes.,” Co RR , vol. abs/1312.6114, 2013.
2[2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27 , pp. 2672–2680. Curran Associates, Inc., 2014.
3[3] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in ICML , 2016, pp. 1747–1756.
4[4] R. A. Yeh ∗ , C. Chen ∗ , T. Y. Lim, A. G. Schwing, M. Hasegawa Johnson, and M. N. Do, “Semantic image inpainting with deep generative models,” in CVPR , 2017, ∗ equal contribution.
5[5] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras, “Neural face editing with intrinsic image disentangling,” in CVPR . IEEE, 2017, pp. –.
6[6] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros, “Context encoders: Feature learning by inpainting,” in CVPR , 2016.
7[7] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in NIPS , pp. 2234–2242. 2016.
8[8] D. Yoo, S. Park Kim, N. Kim, A. S. Paek, and I. Kweon, “Pixel-level domain transfer,” in ECCV , 10 2016, vol. 9912, pp. 517–532.