Semi-Supervised Image-to-Image Translation

Manan Oza; Himanshu Vaghela; Sudhir Bagul

arXiv:1901.08212·cs.CV·January 25, 2019

Semi-Supervised Image-to-Image Translation

Manan Oza, Himanshu Vaghela, Sudhir Bagul

PDF

TL;DR

This paper introduces a semi-supervised, adversarial neural network model for image-to-image translation that preserves content and realism without relying on segmentation or extensive supervision.

Contribution

The proposed model is a semi-supervised, GAN-based approach that is independent of segmentation and content/style features, improving realism and content preservation in image translation.

Findings

01

Outperforms Multimodal Unsupervised Image-to-Image Translation

02

Produces more realistic and content-preserving translations

03

Operates independently of image segmentation

Abstract

Image-to-image translation is a long-established and a difficult problem in computer vision. In this paper we propose an adversarial based model for image-to-image translation. The regular deep neural-network based methods perform the task of image-to-image translation by comparing gram matrices and using image segmentation which requires human intervention. Our generative adversarial network based model works on a conditional probability approach. This approach makes the image translation independent of any local, global and content or style features. In our approach we use a bidirectional reconstruction model appended with the affine transform factor that helps in conserving the content and photorealism as compared to other models. The advantage of using such an approach is that the image-to-image translation is semi-supervised, independant of image segmentation and inherits the…

Equations28

L_{m} = c = 1 \sum 3 V_{c} [O]^{T} M_{I} V_{c} [O]

L_{m} = c = 1 \sum 3 V_{c} [O]^{T} M_{I} V_{c} [O]

(c_{i}, s_{i}) = (E_{i}^{c} (x_{i}), E_{i}^{s} (x_{i})) = E_{i} (x_{i})

(c_{i}, s_{i}) = (E_{i}^{c} (x_{i}), E_{i}^{s} (x_{i})) = E_{i} (x_{i})

x_{1 ⟶ 2} = G_{2} (c_{1}, s_{2})

x_{1 ⟶ 2} = G_{2} (c_{1}, s_{2})

L_{r eco n}^{x_{1}} = E_{x_{1} \sim p (x_{1})} [∥ G_{1} (E_{1} (x_{1})) - x_{1} ∥_{1}]

L_{r eco n}^{x_{1}} = E_{x_{1} \sim p (x_{1})} [∥ G_{1} (E_{1} (x_{1})) - x_{1} ∥_{1}]

L_{r eco n}^{c_{1}} = E_{c_{1} \sim p (c_{1}), s_{2} \sim p (s_{2})} [∥ E_{2}^{c} (G_{2} (c_{1}, s_{2})) - c_{1} ∥_{1}]

L_{r eco n}^{c_{1}} = E_{c_{1} \sim p (c_{1}), s_{2} \sim p (s_{2})} [∥ E_{2}^{c} (G_{2} (c_{1}, s_{2})) - c_{1} ∥_{1}]

L_{r eco n}^{s_{2}} = E_{c_{1} \sim p (c_{1}), s_{2} \sim p (s_{2})} [∥ E_{2}^{s} (G_{2} (c_{1}, s_{2})) - s_{2} ∥_{1}]

L_{r eco n}^{s_{2}} = E_{c_{1} \sim p (c_{1}), s_{2} \sim p (s_{2})} [∥ E_{2}^{s} (G_{2} (c_{1}, s_{2})) - s_{2} ∥_{1}]

L_{G A N}^{x_{2}} = E_{c_{1} \sim p (c_{1}), s_{2} \sim p (s_{2})} [l o g (1 - D_{2} (G_{2} (c_{1}, s_{2})))] +

L_{G A N}^{x_{2}} = E_{c_{1} \sim p (c_{1}), s_{2} \sim p (s_{2})} [l o g (1 - D_{2} (G_{2} (c_{1}, s_{2})))] +

E_{x_{2} \sim p (x_{2})} [l o g D_{2} (x_{2})]

E_{1}, E_{2}, G_{1}, G_{2} min_{p} D_{1}, D_{2} max_{p} L (E_{1}, E_{2}, G_{1}, G_{2}, D_{1}, D_{2}) =

E_{1}, E_{2}, G_{1}, G_{2} min_{p} D_{1}, D_{2} max_{p} L (E_{1}, E_{2}, G_{1}, G_{2}, D_{1}, D_{2}) =

L_{G A N}^{x_{1}} + L_{G A N}^{x_{2}} + λ_{x} (L_{r eco n}^{x_{1}} + L_{r eco n}^{x_{2}}) +

λ_{c} (L_{r eco n}^{c_{1}} + L_{r eco n}^{c_{2}}) + λ_{s} (L_{r eco n}^{s_{1}} + L_{r eco n}^{s_{2}}) +

λ_{A} (L_{m}^{x_{1}} + L_{m}^{x_{2}})

p (c_{1}) = p (c_{2})

p (c_{1}) = p (c_{2})

p (s_{1}) = q (s_{1})

p (s_{1}) = q (s_{1})

p (s_{2}) = q (s_{2})

p (s_{2}) = q (s_{2})

p (x_{1}, x_{1 ⟶ 2}) + p (c_{1}) = p (x_{2 ⟶ 1}, x_{2}) + p (c_{2})

p (x_{1}, x_{1 ⟶ 2}) + p (c_{1}) = p (x_{2 ⟶ 1}, x_{2}) + p (c_{2})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Semi-Supervised Image-to-Image Translation

Manan Oza

Department of Computer Engineering

*D. J. Sanghvi College of Engineering

*Mumbai, India

[email protected]

Himanshu Vaghela

Department of Computer Engineering

*D. J. Sanghvi College of Engineering

*Mumbai, India

[email protected]

Prof. Sudhir Bagul

Department of Computer Engineering

*D. J. Sanghvi College of Engineering

*Mumbai, India

[email protected]

Abstract

Image-to-image translation is a long-established and a difficult problem in computer vision. In this paper we propose an adversarial based model for image-to-image translation. The regular deep neural-network based methods perform the task of image-to-image translation by comparing gram matrices and using image segmentation which requires human intervention. Our generative adversarial network based model works on a conditional probability approach. This approach makes the image translation independent of any local, global and content or style features. In our approach we use a bidirectional reconstruction model appended with the affine transform factor that helps in conserving the content and photorealism as compared to other models. The advantage of using such an approach is that the image-to-image translation is semi-supervised, independant of image segmentation and inherits the properties of generative adversarial networks tending to produce realistic. This method has proven to produce better results than Multimodal Unsupervised Image-to-image translation.

Index Terms:

GANs, image-to-image translation, style transfer

I Introduction

Image-to-image transfer has established itself as an important domain in computer vision since the first paper published by Gatys et al. [1]. Also known as Neural Style Transfer, it has had many variations over the years, image colorization [14], style transfer [1], image-to-image transfer [3] and so on. For which generally deep neural networks have been used with architectural variances. For instance, we can make a day time image (also known as the content image) of a city look like a night time image by selecting the appropriate style (reference) image. Likewise we can have diverse types of features transfered from one image to another which include time, color, seasonal translations as well.

Image-to-image translation is the process of translating one image onto another while preserving the content and photorealism of the original content image. Deep-learning techniques have proved excellent in faithful and photorealistic style translation [1, 2, 3, 7]. Our approach is built upon the idea of generative adversarial networks introduced by Goodfelow et al. [8]. The underlying concept of such a neural network architecture is that a GAN consists of a generator and a discriminator. The discriminator is trained to identify real images while the generator tries to fool the discriminator by creating counterfiet images from noise and passes them on to the discriminator. Which then returns a verdict on how close the counterfiet images are to a real one. Based on this feedback the generator improves itself and creates another image and the cycle repeats.

Here in our paper we make use of an improvised GAN architecture appended with an Affine Loss factor calculated from a Matting Laplacian matrix [6] in the final loss function. This additionl factor helps in maintaining spatial integrity and preserve photorealism in the content image. Since generative adversarial networks create images from noise they are prone to distortions and noisy images but provide with the biggest advantage, they do not form the basis of simple color and style mapping. They recreate the content image with the style variations.

II Related Work

Image-to-image style transfer has reached state-of-the-art [2, 3, 7] results. The current existing algorithms work in either of the two broadly divided classes: local translation and global translation. But neither of the algorithms excel in both photorealism and faithful style translation at the same time and for all test cases. One or the other factor gets compromised. Global stylization methods work by matching statistical factors of the pixel values [11] whereas local stylization is achieved by algorithms that find close and consistent relations between pixel values of the content and style images. Another classification is based on the algorithm’s ability to translate low-level and (or) high-level features. Low-level features translation involves preservation of the intricacies in the content image while modifying the color or position with respect to the style image. Whereas high-level feature translation is the mapping of broader features which by example means day to night, summer to winter translations.

The best works proposed by Luan et al. [2] and Li et al. [7] are based on the paradigm of matching the gram matrices and makes use of semantic segmentaions of the content and style images. Which take in only the content and style images as the inputs for the network. These algorithms perform post-processing like affine smoothing techniques thereby drastically improving the quality of the resultant images. Such methodologies make use of segmented images derived from the content and style images and then perform style translations from one segment to another by comparing the gram matrices of the input images. Other such algorithms based on a similar paradigm are proposed by Gatys et al. [1], Huang et al. [2] and many others, [1, 7, 12].

Promising results have been showcased by various GAN architectures namely Pix2pix by Isola et al. [13], Unsupervised Image-to-image Translation by Liu et al. [4], CycleGAN and BicycleGAn by Zhu et al. [5]. All of which take in a dataset consisting of multiple images similar to the content and the style domain. Multimodal Unsupervised Image-to-Image Translation by Huang et al. [3] provides an approach to the problem by narrowing down the content domain to only one image and a number of style images which constitute the style latent code [3]. They have proposed that to make the translation unsupervised the syle images are decomposd into a common style latent space. The content space is sampled from this style space based on a conditional distribution to perfom the translation.

In our proposal we narrow down our method to one content and one style image which does not make it completely unsupervised as there is only one target style image. We use the same architecture as proposed by Huang et al. [3] with an additional affine loss factor added to the loss function which adds to the smoothness and faithful style transfer which are combined with the properties of generative adversarial networks.

III Methodology

In addition to the model proposed by Huang et al. [3] we add the local affine transfrom $\mathcal{L}_{m}$ also known as the photorealism factor of the content image calculated from the Matting Laplacian matrix proposed by Levin et al. [6].

III-A Assumptions

All assumptions are exactly the same as that made in the paper Multimodal Unsupervised Image-to-Image Translation by Huang et al. [3] which are as follows. The model assumes that the content and style images are composed of distinct image spaces $x_{i}\in$ * $\mathcal{X}_{i}$ * where $x_{i}$ is the $i^{th}$ image and $\mathcal{X}_{i}$ is its corresponding image space. Here our goal is to estimate the conditional distributions $p(x_{1}|x_{2})$ and $p(x_{2}|x_{1})$ leading to the learned translation models $p(x_{1\longrightarrow 2}|x_{2})$ and $p(x_{2\longrightarrow 1}|x_{1})$ respectively given that $p(x_{1})$ and $p(x_{2})$ are the marginal distributions of $x_{1}$ and $x_{2}$ respectively.

We make another assumption that $x_{i}\in$ * $\mathcal{X}_{i}$ * is composed of a content latent space $c\in$ * $\mathcal{C}$ * and a style latent space $s_{i}\in$ * $\mathcal{S}_{i}$ * corresponding to every image from the dataset. Thus two images $(x_{1},x_{2})$ are generated from the individual generators by $x_{1}=G_{1}^{*}(c,s_{2})$ and $x_{2}=G_{2}^{*}(c,s_{1})$ . $G_{1}^{*}$ and $G_{2}^{*}$ are generator functions with $E_{1}^{*}$ and $E_{2}^{*}$ being their inverse encoders where $E_{1}^{*}=(G_{1}^{*})^{-1}$ and $E_{2}^{*}=(G_{2}^{*})^{-1}$ . Hence our aim is to train the encoder and generator functions using neural networks.

III-B Matting Laplacian

Image matting is the process of extracting the foreground and the background from an image with minimal possible user intervention. The Matting Laplacian [6] process produces an alpha matte which is the segmented image with the foreground object in white and the background in black or vice versa as per the requirements. Using this matting laplacian matrix we calculate the local affine transform factor $\mathcal{L}_{m}$ also known as the photorealism factor.

[TABLE]

It is a summation of the affine losses of all the three channels of the image. $\mathcal{M}_{I}$ is the least-squares penalty function that is dependant on the input image I. The dimensions of the $\mathcal{M}_{I}$ matrix are (N $\times$ N) and $V_{c}[O]$ is the vectorized format of the input image O in the channel c having dimensions (N $\times$ 1). Thus this factor proves crucial in preserving the photorealism and the content image in our proposal.

III-C Model

Our model given in figure 3 constitutes an encoder and a decoder $E_{i}^{*}$ and $G_{i}^{*}$ respctively for every domain $\mathcal{X}_{i}$ , in our case $i$ = 1, 2. The encoder is factorized from the content and style latent codes $c_{i}$ and $s_{i}$ .

[TABLE]

Thus for image-to-image translation we interchange the encoders and decoders i.e. for translation $x_{1\longrightarrow 2}$ we make use of the content code $c_{1}$ = $E_{1}^{c}(x_{1})$ and a randomly drawn style latent code from $s_{2}$ . Subsequently we use the decoder $G_{2}$ to generate the image.

[TABLE]

The loss function is composed of two factors, the bidirectional reconstruction loss and the adversarial loss. The bidirectional reconstruction loss is added to make sure that there is a two way reconstruction of images in the directions, image $\rightarrow$ latent $\rightarrow$ image and latent $\rightarrow$ image $\rightarrow$ latent. The image reconstruction loss is computed as the difference between the image reconstructed from the latent spaces $c_{1}$ and $s_{1}$ of image $x_{1}$ and the image $x_{1}$ which is given by (it is similar to $\mathcal{L}_{recon}^{x_{2}}$ for the image $x_{2}$ ):

[TABLE]

The latent reconstruction loss $\mathcal{L}_{recon}^{c_{1}}$ is the difference between the content encoding of the generated image $G_{2}(c_{1},s_{2})$ and the content encoding $c_{1}$ of the image $x_{1}$ and $\mathcal{L}_{recon}^{s_{2}}$ is the difference between the style encoding of the generated image $G_{2}(c_{1},s_{2})$ and the style encoding $s_{2}$ of the image $x_{2}$ they are given by the equations (which are similar for and $\mathcal{L}_{recon}^{s_{1}}$ ):

[TABLE]

Here $q(s_{2})$ is defined as the prior $\mathcal{N}$ (0, I) and $p(c_{1})$ is defined as $c_{1}=E_{1}^{c}(x_{1})$ where $x_{1}\sim p(x_{1})$ .

Since we use a GAN framework we encounter an adversarial loss which is supposed to be minimised so that the generated images are as identical as possible to the original images. This loss is given by:

[TABLE]

Here $D_{2}$ is the discriminator function that distinguishes between the real image $x_{2}$ and the translated images. The discriminator function $D_{1}$ and loss $\mathcal{L}_{GAN}^{x_{2}}$ are defined in a similar way.

As mentioned earlier the architecture we use is essentially the same as that was proposed by Huang et al. [3]. The only difference being that in our approach we use only one style image and add the affine transform loss in the overall loss function. We assume [2] that the input images are photorealistic and we do not have to lose this property. Thus we penalize the loss fuction with the photorealism factor so as not to lose this property while minimizing the reconstruction losses from the image, content and style latent spaces. The overall loss fuction proposed by us is given by:

[TABLE]

Where $\lambda_{x}$ , $\lambda_{c}$ , $\lambda_{s}$ are the weights that control the reconstruction, and $\lambda_{A}$ is the photorealism regularization weight [2].

III-D Analysis

Our goal is to minimize the loss function defined in equation (8). This minima is the optimal state of of our model and at this point the following states are achieved:

[TABLE]

The equation (12) is different from the one proposed by Huang et al. [3] because our model adds the local affine loss of the content images. Our model is constructed in such a way that when $x_{1}$ is the content image $x_{2}$ is taken as the style image and vice versa. Which is why the local affine loss of both the images is taken into consideration in equation (8) and also the content marginal distributions are added and taken into account when comparing the joint distributions $p(x_{1},x_{1\longrightarrow 2})$ and $p(x_{2\longrightarrow 1},x_{2})1$ . At this state the content marginal distributions $p(c_{1})$ and $p(c_{2})$ also become equal. Also at this optimal state the style marginal distributions $p(s_{i})$ are equal to their prior distributions $q(s_{i})$ . The fact that we use one one-to-one image mapping makes our process sound like it follows the supervised learning paradigm, but it does not. Even though we have only one image in the content and style domain the images are encoded into a content and style latent space and translated on the basis of conditional probability. Thus our method is free from any deterministic translations as performed by the methods [1, 2, 7, 12, 14] which make use of image segmentation that helps in mapping regions of interest in both the content and style images.

IV Implementation Details

We have adapted the publicly available pytorch implementation of Multimodal Unsupervised Image-to-image Translation [3]. The architecture consists of an auto-encoder (generator) and a discriminator. The auto-encoder comprises of a separate content and style encoder and a combined decoder. The auto-encoder architecture consists of the following layers:

•

The content encoder whose content makes up the content latent space (in the listed order):

–

7 $\times$ 7 convolutional block with stride 1 and 64 filters.

–

4 $\times$ 4 convolutional block with stride 2 and 128 filters.

–

4 $\times$ 4 convolutional block with stride 2 and 256 filters.

–

4 residual blocks each consisting of two 3 $\times$ 3 convolutional blocks with 256 filters.

•

The style encoder whose output is added to the style latent space (in the listed order):

–

7 $\times$ 7 convolutional block with stride 1 and 64 filters.

–

4 $\times$ 4 convolutional block with stride 2 and 128 filters.

–

3 4 $\times$ 4 convolutional block with stride 2 and 256 filters.

–

Global average pooling layer.

–

Fully connected layer with 8 filters.

•

The decoder which reconstructs an image from the content and style latent code (in the listed order):

–

4 residual blocks each consisting of two 3 $\times$ 3 convolutional blocks with 256 filters.

–

2 $\times$ 2 nearest-neighbour upsampling layer followed by a 5 $\times$ 5 convolutional layer with stride 1 and 128 filters.

–

2 $\times$ 2 nearest-neighbour upsampling layer followed by a 5 $\times$ 5 convolutional layer with stride 1 and 64 filters.

–

7 $\times$ 7 convolutional block with stride 1 and 3 filters.

The discriminator used is a multi-scale discriminator proposed by Wang et al. [9] which makes use of the LSGAN objective function proposed by Mao et al. [10]. This helps to pilot the generator towards producing realistic and perfom effective translation while preserving the content. The architecture consists of the following layers in the listed order:

•

4 $\times$ 4 convolutional block with stride 2 and 64 filters.

•

4 $\times$ 4 convolutional block with stride 2 and 128 filters.

•

4 $\times$ 4 convolutional block with stride 2 and 256 filters.

•

4 $\times$ 4 convolutional block with stride 2 and 512 filters.

We use the python implementation to compute the Matting Laplacian matrix [16] from the tensorflow implementation of Deep Photo Style Transfer [2]. The image, content and style reconstruction weights and the photorealism regularization weight are experimentally set to $\lambda_{x}$ = 10, $\lambda_{c}$ = 1, $\lambda_{s}$ = 1 and $\lambda_{A}$ = $10^{4}$ [2] respectively. Our implementation is available on $https://github.com/ozamanan/semisit$ .

V Results

Our dataset is composed of only two 3 channel images with resolution 256 $\times$ 256. Thus we use a batch size of 1. Furthermore for every iteration both the images from the dataset are used once to train the respective parts of the network. At once when one image is used as the content image the other one is used as the style image and vice versa thereby completing the bidirectional reconstruction process. All images used for experimental purposes are taken from the implementation of Deep Photo Style Transfer [2].

The images fig. 4e, 4f, 4k, 4l, 4q and 4r shown in fig. 4 are the results generated from our code whereas the images 4c, 4d, 4i, 4j, 4o and 4p are the results generated using the code of Huang et al. [3]. The results shown by us are the optimal results beyond which the images tend to converge to their respective style images. The optimal solution is the one where the resultant image holds the properties of both the content and style images while still being recognised by the discriminator as a constituent image of the dataset. This optimal state is mentioned in eq. (12).

This optimal state clearly shows an improvement in content preservation and image smoothness over the proposal of Huang et al. [3]. It is achieved due to the addition of the affine transform factors $\mathcal{L}_{m}^{x_{1}}$ and $\mathcal{L}_{m}^{x_{2}}$ . Thus our proposed methodology generates results that are better in comparison to the results from the method used by Huang et al. [3].

VI Conclusion

We have proposed an architecture that performs the task of unsupervised image-to-image translation with better accuracy and results. The future work includes reducing the noise and making the results more accurate even for low resolutions. Another future scope lies in broadening this architecture for the generation of music, text and videos.

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423, 2016.
2[2] Luan, F., Paris, S., Shechtman, E., Bala, K.: Deep photo style transfer. In: CVPR. (2017)
3[3] Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-toimage translation. In: ECCV. (2018)
4[4] Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: NIPS. (2017)
5[5] Almahairi, A., Rajeswar, S., Sordoni, A., Bachman, P., Courville, A.: Augmented cyclegan: Learning many-to-many mappings from unpaired data. ar Xiv preprint ar Xiv:1802.10151 (2018)
6[6] A. Levin, D. Lischinski, and Y. Weiss. A closed-form solution to natural image matting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):228–242, 2008.
7[7] Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz, “A closed-form solution to photorealistic image stylization,” ar Xiv:1802.06474, 2018.
8[8] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS. (2014)