TL;DR
SMIT introduces a unified framework for multi-label, multimodal image translation that handles unpaired datasets, multiple attributes, and style diversity using a single generator and domain embeddings.
Contribution
It presents a novel joint approach combining diversity and multi-mapping in image translation with a single generator and domain embeddings, addressing multiple challenges simultaneously.
Findings
Outperforms state-of-the-art in multi-label and multimodal translation tasks.
Effectively handles continuous style and label interpolation.
Generalizes well across different datasets and scenarios.
Abstract
Cross-domain mapping has been a very active topic in recent years. Given one image, its main purpose is to translate it to the desired target domain, or multiple domains in the case of multiple labels. This problem is highly challenging due to three main reasons: (i) unpaired datasets, (ii) multiple attributes, and (iii) the multimodality (e.g., style) associated with the translation. Most of the existing state-of-the-art has focused only on two reasons, i.e. either on (i) and (ii), or (i) and (iii). In this work, we propose a joint framework (i, ii, iii) of diversity and multi-mapping image-to-image translations, using a single generator to conditionally produce countless and unique fake images that hold the underlying characteristics of the source image. Our system does not use style regularization, instead, it uses an embedding representation that we call domain embedding for both…
| CycleGAN | BiCycleGAN | StarGAN | MUNIT&alike | DRIT | GANimation | SMIT | |
| [55] | [56] | [12] | [23, 3, 39] | [34] | [46] | (ours) | |
| Unpaired Training | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| \rowcolor[HTML]F3F7FF Multimodal Generation | ✓ | ✓ | ✓ | ✓ | |||
| Multiple Attributes | ✓ | ✓ | ✓ | ||||
| \rowcolor[HTML]F3F7FF One Single Generator | ✓ | ✓ | ✓ | ||||
| Fine-grained Transformation | ✓ | ✓ | ✓ | ||||
| \rowcolor[HTML]F3F7FF Continuous Label Interpolation | ✓ | ✓ | |||||
| Style Transformation | ✓ | ✓ | ✓ | ||||
| \rowcolor[HTML]F3F7FF Style Interpolation | ✓ | ✓ | ✓ | ||||
| Attention Mechanism | ✓ | ✓ |
| Yosemite [26] | ||||
| D | PD | |||
| \rowcolor[HTML]F3F7FF SMITno_style | 0.4120.046 | - | ||
| SMITDE_learning | 0.4130.044 | 0.0040.003 | ||
| \rowcolor[HTML]F3F7FF SMITno_atention | 0.4060.041 | 0.1050.071 | ||
| SMITstyle_encoder | 0.4180.043 | 0.1330.063 | ||
| \rowcolor[HTML]F3F7FF SMIT | 0.4190.048 | 0.1450.072 | ||
| Edges2Shoes [26] | Edges2Handbags [26] | Yosemite [26] | # Parameters | ||||
| D | PD | D | PD | D | PD | (Generator) | |
| \rowcolor[HTML]F3F7FF CycleGAN [55] | 0.2720.048 | - | 0.2930.081 | - | 0.2720.048 | - | 2x11.4M |
| DRIT [34] | 0.2370.149 | 0.0280.030 | 0.2960.181 | 0.0560.060 | 0.3980.038 | 0.1260.019 | 2x21.3M |
| \rowcolor[HTML]F3F7FF MUNIT [23] | 0.2950.051 | 0.0770.057 | 0.3650.052 | 0.1230.067 | 0.3350.045 | 0.2080.034 | 2x15.0M |
| SMIT (ours) | 0.3030.058 | 0.0720.056 | 0.3670.048 | 0.0960.072 | 0.4370.041 | 0.1450.072 | 8.4M |
| Real Data | 0.3130.052 | - | 0.3740.051 | - | 0.4470.049 | - | - |
| Part | Input Output Shape | Layer Information |
| Down-sampling | Conv2d(dim=32, kernel=7, stride=1, padding=3), IN, ReLU | |
| Conv2d(64, 4, 2, 1), IN, ReLU | ||
| Conv2d(128, 4, 2, 1), IN, ReLU | ||
| Conv2d(256, 4, 2, 1), IN, ReLU | ||
| Bottleneck | Residual Block: Conv2d(256, 3, 1, 1), AdaIN, ReLU | |
| Residual Block: Conv2d(256, 3, 1, 1), AdaIN, ReLU | ||
| Residual Block: Conv2d(256, 3, 1, 1), AdaIN, ReLU | ||
| Residual Block: Conv2d(256, 3, 1, 1), AdaIN, ReLU | ||
| Residual Block: Conv2d(256, 3, 1, 1), AdaIN, ReLU | ||
| Residual Block: Conv2d(256, 3, 1, 1), AdaIN, ReLU | ||
| Up-sampling | Nearest Upsampling (2x), Convd2d(128, 3, 1, 1), LN, ReLU | |
| Nearest Upsampling (2x), Convd2d(64, 3, 1, 1), LN, ReLU | ||
| Nearest Upsampling (2x), Convd2d(32, 3, 1, 1), LN, ReLU | ||
| Fake Output () | Conv2d(3, 7, 1, 3), None, Tanh | |
| Attention mask ( | Conv2d(1, 7, 1, 3), None, Sigmoid |
| Layer | Input Output Shape | Layer Information |
| Embedding Projection | FullyConnected(dim=) |
| Layer | Input Output Shape | Layer Information |
| Input Layer | Conv2d(dim=32, kernel=4, stride=2, padding=1), SN, LReLU | |
| Hidden Layer | Conv2d(64, 4, 2, 1), SN, LReLU | |
| Hidden Layer | Conv2d(128, 4, 2, 1), SN, LReLU | |
| Hidden Layer | Conv2d(256, 4, 2, 1), SN, LReLU | |
| Hidden Layer | Conv2d(512, 4, 2, 1), SN, LReLU | |
| Hidden Layer | Conv2d(1024, 4, 2, 1), SN, LReLU | |
| Hidden Layer | Conv2d(2048, 4, 2, 1), SN, LReLU | |
| Output Layer () | Conv2d(1, 3, 1, 1) | |
| Output Layer () | Conv2d() |
| Edges2Shoes | ||||
| Edges | Shoes | |||
| D | PD | D | PD | |
| \rowcolor[HTML]F3F7FF CycleGAN | 0.2690.046 | - | 0.2750.050 | - |
| DRIT | 0.0000.000 | 0.0000.000 | 0.2430.052 | 0.0560.017 |
| \rowcolor[HTML]F3F7FF MUNIT | 0.2690.049 | 0.0270.005 | 0.2630.049 | 0.1260.039 |
| SMIT (ours) | 0.2740.046 | 0.0200.006 | 0.2610.060 | 0.1230.029 |
| Real Data | 0.2740.046 | - | 0.2930.051 | - |
| Edges2Handbags | ||||
| Edges | Handbags | |||
| D | PD | D | PD | |
| \rowcolor[HTML]F3F7FF CycleGAN | 0.2250.043 | - | 0.3610.045 | - |
| DRIT | 0.0000.000 | 0.0000.000 | 0.3440.061 | 0.1120.032 |
| \rowcolor[HTML]F3F7FF MUNIT | 0.3520.045 | 0.0630.016 | 0.3340.052 | 0.1830.039 |
| SMIT (ours) | 0.3730.041 | 0.0290.010 | 0.3460.048 | 0.1640.035 |
| Real Data | 0.3460.045 | - | 0.3700.053 | - |
| Edges2Objects | ||||||||
| Edges Shoes | Shoes | Edges Handbags | Handbags | |||||
| D | PD | D | PD | D | PD | D | PD | |
| \rowcolor[HTML]F3F7FF CycleGAN | - | - | - | - | - | - | - | - |
| DRIT | - | - | - | - | - | - | - | - |
| \rowcolor[HTML]F3F7FF MUNIT | - | - | - | - | - | - | - | - |
| SMIT (ours) | 0.1300.104 | 0.0550.024 | 0.2860.07 | 0.1680.028 | 0.2790.045 | 0.0120.008 | 0.3040.052 | 0.2330.060 |
| Real Data | 0.2740.046 | - | 0.2930.051 | - | 0.3460.045 | - | 0.3700.053 | - |
| Yosemite | ||||
| Summer | Winter | |||
| D | PD | D | PD | |
| \rowcolor[HTML]F3F7FF CycleGAN | 0.4080.037 | - | 0.4060.041 | - |
| DRIT | 0.4050.033 | 0.1200.018 | 0.3950.040 | 0.1310.020 |
| \rowcolor[HTML]F3F7FF MUNIT | 0.3720.034 | 0.2120.029 | 0.3130.035 | 0.2040.037 |
| SMIT (ours) | 0.3780.048 | 0.1670.070 | 0.4100.049 | 0.1290.069 |
| Real Data | 0.4440.055 | - | 0.4440.040 | - |
| RafD | ||||||||
| Conditional Inception Score (CIS) | ||||||||
| Neutral | Anger | Contempt | Disgust | Fear | Happy | Sad | Surprise | |
| \rowcolor[HTML]F3F7FF StarGAN | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| GANimation | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| \rowcolor[HTML]F3F7FF SMIT (ours) | 1.201 | 1.187 | 1.197 | 1.237 | 1.329 | 1.373 | 1.249 | 1.201 |
| RafD | ||||||||
| Inception Score (IS) | ||||||||
| Neutral | Anger | Contempt | Disgust | Fear | Happy | Sad | Surprise | |
| \rowcolor[HTML]F3F7FF StarGAN | 2.039 | 1.407 | 2.194 | 1.081 | 1.748 | 1.483 | 2.060 | 1.275 |
| GANimation | 1.559 | 1.320 | 2.024 | 1.115 | 1.427 | 1.698 | 1.888 | 1.033 |
| \rowcolor[HTML]F3F7FF SMIT (ours) | 3.502 | 2.246 | 3.441 | 1.598 | 2.451 | 2.327 | 3.009 | 1.527 |
| Real Data | 1.120 | 1.439 | 1.401 | 1.001 | 1.360 | 1.001 | 1.126 | 1.007 |
| RafD | ||||||||
| Diversity (D) | ||||||||
| Neutral | Anger | Contempt | Disgust | Fear | Happy | Sad | Surprise | |
| \rowcolor[HTML]F3F7FF StarGAN | 0.157 | 0.154 | 0.152 | 0.152 | 0.152 | 0.150 | 0.149 | 0.150 |
| GANimation | 0.156 | 0.156 | 0.154 | 0.156 | 0.156 | 0.157 | 0.159 | 0.160 |
| \rowcolor[HTML]F3F7FF SMIT (ours) | 0.164 | 0.161 | 0.162 | 0.163 | 0.163 | 0.164 | 0.165 | 0.170 |
| Real Data | 0.167 | 0.165 | 0.166 | 0.166 | 0.166 | 0.167 | 0.167 | 0.167 |
| RafD | ||||||||
| Partial Diversity (PD) | ||||||||
| Neutral | Anger | Contempt | Disgust | Fear | Happy | Sad | Surprise | |
| \rowcolor[HTML]F3F7FF StarGAN | - | - | - | - | - | - | - | |
| GANimation | - | - | - | - | - | - | - | |
| \rowcolor[HTML]F3F7FF SMIT (ours) | 0.003 | 0.004 | 0.003 | 0.004 | 0.004 | 0.004 | 0.003 | 0.005 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\extrafloats
100
SMIT: Stochastic Multi-Label Image-to-Image Translation
Andrés Romero
BCV Lab
Universidad de Los Andes
Pablo Arbeláez
BCV Lab
Universidad de Los Andes
Luc Van Gool
ETH Zürich
KU Leuven
Radu Timofte
CV Lab
ETH Zürich
Abstract
Cross-domain mapping has been a very active topic in recent years. Given one image, its main purpose is to translate it to the desired target domain, or multiple domains in the case of multiple labels. This problem is highly challenging due to three main reasons: (i) unpaired datasets, (ii) multiple attributes, and (iii) the multimodality (e.g. style) associated with the translation. Most of the existing state-of-the-art has focused only on two reasons i.e., either on (i) and (ii), or (i) and (iii). In this work, we propose a joint framework (i, ii, iii) of diversity and multi-mapping image-to-image translations, using a single generator to conditionally produce countless and unique fake images that hold the underlying characteristics of the source image. Our system does not use style regularization, instead, it uses an embedding representation that we call domain embedding for both domain and style. Extensive experiments over different datasets demonstrate the effectiveness of our proposed approach in comparison with the state-of-the-art in both multi-label and multimodal problems. Additionally, our method is able to generalize under different scenarios: continuous style interpolation, continuous label interpolation, and fine-grained mapping. Code and pretrained models are available at https://github.com/BCV-Uniandes/SMIT.
1 Introduction
The ability of humans to easily imagine how a black haired person would look like if they were blond, or with a different type of eyeglasses, or to imagine a winter scene as summer is formulated as the image-to-image (I2I) translation problem in the computer vision community. Since the recent introduction of Generative Adversarial Networks (GANs) [19], a plethora of problems such as video analysis [51, 7], super resolution [33, 9], semantic synthesis [26, 10], photo enhancement [24, 25], photo editing [49, 14], and most recently domain adaptation [21, 43] have been addressed as I2I translation problems.
Initially, translating from one domain into another required paired datasets that exactly matched both domains [26] e.g., edgesshoes or edgeshandbags datasets. However, this approach is unpractical because the full representation of the cross-domain mapping is, in most cases, intractable. Existing techniques try to perform deterministic I2I translation with unpaired images to map from one domain into another (one-to-one) [55, 4, 37, 25], or into multiple domains (one-to-many) [12, 46, 20]. Nevertheless, many problems are fundamentally stochastic as there are countless mappings from one domain to another e.g., a daynight or catdog translation.
Recent techniques [34, 23, 39] have successfully addressed the multimodal representation for one-to-one domain translation. These methods are based on the idea developed on traditional I2I approaches [55, 56], in which the generator tends to overlook a noise injection. As a consequence, these techniques studied the problem of disentangling representation as style transfer, including a shared content space representation and a style encoder network.
In this paper, we propose Stochastic Multi-Label Image-to-Image Translation (SMIT), a novel and robust framework that includes multiple labels and diversity, and does not require either style or content regularization. Moreover, we build our entire approach using a single generator that does not ignore the noise perturbation, i.e. for different level of noise our method produces different styles with the underlying characteristics and structure of the target domain111Hereafter, we refer to domains as the number of labels per dataset, and style as the diversity induced by noise.. As illustrated in Figure 1, SMIT learns a full distribution for each attribute, so it can perform diverse translation for different fine-grained or broader attributes. It is important to remark that in contrast to [12, 46, 30] the trainable parameters in the SMIT generator are not label-dependent, that is there is a negligible difference either on computational time or on memory consumption when learning as many as 40 attributes instead of just 2 labels. Figure 2 presents an overview of our model. We radically depart from mainstream approaches [12, 46, 30], where the target domain is inserted through the spatial concatenation, instead we indirectly inject the style and the target labels through Adaptive Instance Normalization (AdaIN) [22] layers in the generator, and the discriminator aims at recovering only the labels, i.e. we remark the importance of no style regularization.
We perform a comprehensive quantitative evaluation of SMIT either for disentanglement or multiple domain I2I problems, demonstrating the advantages of our method in comparison with existing state-of-the-art models. We also show qualitative results on several datasets that validate the effectiveness of our approach under varied and challenging settings.
More precisely, our main contribution is to propose a single and end-to-end system with an agnostic-domain generator capable of performing style transformation, multi-label mapping, style interpolation, and continuous label interpolation with no need of style regularization. For reproducibility, we plan to release our source code and trained models.
2 Related Work
Generative Adversarial Networks (GANs) [19] have proven to be a powerful approach to learn statistical data distributions. GANs rely on game theory where there are two networks (discriminator and generator) optimizing a Minimax function, a training scheme also known as adversarial training. The discriminator learns to distinguish real images from fake ones produced by the generator, and the generator learns to fool the discriminator by producing realistic fake images. Since their introduction, GANs have provided remarkable results in several computer vision problems, such as image generation [47, 11, 29], image translation [26, 55, 3, 37], video translation [51, 7] and resolution enhancement [6, 33, 2]. As our approach lies in the domain of image-to-image translation, it is the focus of our related work review.
Conditional GANs (cGANs)
In vanilla GANs [19], the information regarding the domain is unknown. Conversely, on conditional GANs (cGANs) [44], the discriminator not only distinguishes between real and fake, but it also trains an auxiliary classifier for the conditional data distribution. cGANs have been applied in image-to-image translation problems for semantic layouts [26, 10], super resolution [33], photo editing [49], and for multi-target domains [12, 30, 46]. While traditional cGANs exploit the underlying conditional distribution of the data, they are constrained to produce deterministic outputs, i.e. given an input and a target label, the output is always the same. In comparison, our approach introduces a style randomness in the generation process.
Image-to-Image Translation (I2I)
Isola et al. [26] introduced a framework in which they trained cGANs using paired datasets. This work led to a new set of previously unexplored I2I problems. Based on these findings, Zhu et al. [55] extended the framework by introducing the cycle-consistency loss, which allowed to perform cross-domain mapping using unpaired datasets. Although CycleGAN [55] is currently one of the most common backbones for I2I models and frameworks, it is constrained to one-to-one domain translation, hence it needs one generator per domain. In contrast, our method uses a single generator regardless of the number of domains.
Other works [12, 46] extended the cycle-consistency insight in order to cope with multiple domains, by using a single generator. These methods take the label as independent features to the first layer of the generator, hence constraining the generator weights to restricted applications. Similarly, additional methods [30, 20] tackled the multilabel mapping problem from a VAE-GAN [32] perspective. Our approach neither uses a variational autoencoder representation nor does it depend on label weights, since the generator has always the same number of parameters regardless of the application.
Disentangled Representations
A recurrent limitation in traditional I2I methods is their deterministic output. In image generation problems [47, 11, 28], disentangled representations are achieved by injecting random noise in the generator. Nevertheless, this idea cannot be used on the seminal CycleGAN, as this framework learns to ignore the noise vector due to the lack of regularization [55].
Recently, there have been efforts [10, 56, 8] to produce diverse representations from a single input. For instance, BiCycleGAN [56] bypassed the regularization issues of CycleGAN and it included a random noise vector in the training scheme, thus generating images of higher quality than CycleGAN. However, this approach requires paired data to train, which makes it unfeasible to scale in real-world scenarios.
Furthermore, generating multimodal images can also be studied as a problem of style transfer [17, 18] between two images. Inspired by the work of Gatys et al. [17], recent approaches [23, 39, 34] split the generator encoder into a two-stream content and style encoder, where the content stream extracts the underlying structure, shape and main information to be preserved on the image, and the style one draws the rendering attributes it aims at transferring. These disentangled representations are similar in spirit with the CycleGAN cycle-consistency adversarial loss since they perform a cross-domain mapping for the style and content space. Consequently, it is difficult to perform fine-grained translations. In comparison, our proposed approach does not suffer in this regard, since we neither constrain the content nor the style distributions. Moreover, as the experiments will show, SMIT is suitable for both coarser translations and subtle local appearances e.g., art in-painting or facial expressions, respectively.
Continuous Interpolation
On the one hand, Pumarola et al. [46] introduced a cGAN framework that takes as input continuous rather than discrete labels. This approach enables the generation of examples with continuous labels at inference time, however, it does not handle diversity for the same input. On the other hand, for binary problems, Lee et al. [34] and Huang et al. [23] performed continuous interpolation between two styles in order to produce a pseudo-animated style transferring with images that belong to the same domain. Our work uses both target and style continuous interpolation.
Table 1 summarizes our main differences with respect to the literature for either multi-label or multimodal translation. SMIT has richer capabilities that those of existing methods as we perform fine-grained local transformation, style transformation, continuous style interpolation, continuous label interpolation, and multi-label transferring using one single generator.
3 Stochastic Multi-Label Image-to-Image Translation (SMIT)
Our final goal is to generate multi-attribute images with different styles using a single generator. As illustrated in Figure 2, our method is an ensemble of three different networks: a generator, a discriminator, and a domain embedding (DE). The generator takes the source image as input and translates it. The discriminator does not only differentiate between real and fake samples, but it also approximates the output distribution of the real target by means of an auxiliary classifier. Finally, SMIT uses the DE to merge both target style and target labels into the generator.
3.1 Problem Formulation
Let be the real image. is encoded by a set of discrete or continuous labels . Additionally, for each possible , there is an unknown style distribution . Given a target label , and a target style , we want to learn a mapping function to produce a fake image , without having access to the joint distribution :
[TABLE]
As it is common in cGANs [12, 46, 11, 47], we have a discriminator that outputs the source domain probability, i.e. true or fake, and a classification/regression estimator, namely, and .
3.2 Model
Generator ()
We build upon the CycleGAN generator [55]. It is inspired in an encoder-decoder architecture, which consists of down-sampling layers, residual blocks, and up-sampling layers. Importantly, we use Instance Normalization (IN) [15, 52], Adaptive Instance Normalization (AdaIN) [22], and Layer Normalization (LN) [5] for the three stages, respectively. The main reason we only use IN during the first stage and not in the up-sampling is because they introduce undesirable properties to the global mean and variance that are modified by AdaIN in the residual Layers.
Domain Embedding (DE)
We indirectly input the target attribute and the style randomness through AdaIN [22] weights. AdaIN normalization is computed from Equation 2, where is the input and are the adaptive parameters.
[TABLE]
As the AdaIN parameters depend entirely on the number of feature maps of the input , they are agnostic to both style and label domains, which makes the generator entirely label and style independent. This key property makes SMIT highly suitable for transfer learning, addressing a drawback of cGANs in real-world scenarios.
It is important to mention that since the style and label dimensions may differ from the dimensions, we use a projection embedding representation to encode style and label inputs to a fixed size suitable for AdaIN (Equation 3).
We remark that the DE does not require any training scheme, instead it is inspired by Language Modeling methods [40, 13, 36, 41, 45] that uses random initialization to map the input to a space embedding distribution. Particularly, we use a simple random embedding, i.e. a fully connected layer to map from style and labels concatenation to the AdaIN parameters. Our rationale is as follows: By always ensuring different , we guarantee different normalization parameters, which means different fake images. We study the DE behaviour in more detail in Section 5.1.
Discriminator ()
As previously stated, the discriminator has two outputs: source domain (src) and auxiliary classifier (cls). First, we use the idea of patch-GAN [26], to tell whether the source is fake or true based on a patch rather than a single number (). Second, we have a binary cross entropy loss function for the conditional labels (). If continuous labels are used, then a regression objective loss should be applied. However, as we will discuss Section 5.2, our approach is capable of generating continuous labels even if it was trained with discrete ones.
3.2.1 Training Framework
In order to approximate function in Equation 1, we split our general loss function for clarity.
Adversarial Loss
We use the recently introduced averaged Relativistic Adversarial Loss (RGAN) [27] and the hinge version [42] loss to train the adversarial loss. RGAN relies on the idea that the discriminator not only estimates whether images are real or fake, but it also estimates the probability that the given real images are more realistic than the fake ones.
[TABLE]
Conditional Loss
The adversarial loss does not include any regularization for the conditional labels, yet the generator must be able to produce both realistic and conditioned images. To solve this issue, we define the conditional loss as:
[TABLE]
Recovery Loss
In order to produce , we jointly input the target label and the target style. Therefore, the cycle consistency loss employed to recover the original image can be naively defined as:
[TABLE]
Note that the original style () is an unknown parameter. Nonetheless, we assume that is drawn from a known normal distribution, and therefore reformulate the reconstruction loss by adding a different random style . We assume random styles during the whole training process. Thus, we compute the reconstruction or cycle consistency loss as:
[TABLE]
Attention Loss
Until this point, there is no guarantee that the output of our generator will preserve background details e.g., the underlying structure, or the identity of a person. To solve this particular issue, we regularize our model with the unsupervised attention mechanism proposed by Pumarola et al. [46]. We add a new and parallel layer to the generator output () that works as the attention mask ().
The attention loss encourages fake images to change only certain regions with respect to the real input, and it is decomposed by the following terms:
[TABLE]
Identity Loss
To further stabilize the training framework, we regularize our model with the identity loss that is defined as follows:
[TABLE]
Overall Loss
We define our full objective function in Equation 9, as the weighed sum of the previous losses:
[TABLE]
Remarkably, our method does not require style regularization [23, 34] since we use a training framework that can easily bypass it.
4 Experimental Setup
We validate our method over several and very different datasets and tasks, such as instance facial synthesis [38], emotion recognition [31], Yosemite summerwinter [26], and edges-to-object generation [26].
In the supplementary material, we extend our qualitative results to painters [4], Alps seasons [4], RafD [31], BP4D [54], EmotionNet [16], and full CelebA [38] with 40 attributes.
4.1 Evaluation Metrics
Diverse Translation
The LPIPS metric [53] allows us to quantify the similarity between two different images. LPIPS computes the L2 distance between pairs of deep features (e.g., AlexNet, VGG, etc) images.
Multi-label Translation
Besides the LPIPS score, we also compute the Inception Score (IS) [48] that is a popular score for I2I problems. The IS employs an Inception Network [50] to classify fake images and thus rank them according to their scores with respect to the prior distribution. Additionally, we report the Conditional Inception Score (CIS) [23] that quantifies both high quality and diverse mapping.
4.2 Evaluation Framework
Given the unique nature of our approach, we unfold the quantitative evaluation into two different schemes: multimodal evaluation, and multi-label evaluation.
Multimodal Evaluation
We directly use MUNIT [23] and DRIT [34] to compare our method in GAN-based disentangled representations. For fair comparison under this setting, we work within the same datasets Edges [26] and Yosemite [55]. To this end, we train MUNIT and DRIT and report the corresponding LPIPS over the whole test set.
We use the LPIPS score to measure the diversity of the generated images. As there is no standard evaluation framework for the diversity in GAN-based problems, we use a set of two metrics. First, as in MUNIT, we compute the diversity one-vs-all across the entire dataset (D), using the diversity in the real data as a reference. Then, we use one single fixed style to produce the cross-mapping in order to compute the diversity along the entire fake dataset. Second, as in DRIT, given a single image, we measure the partial diversity (PD) across different modalities (20 different styles) and report the average and standard deviation over each image, over the whole set.
Multi-label Evaluation
Additionally, for purely multi-label I2I methods, we train an Inception network [50] on a RafD train set (90%) and report the IS and CIS over the remaining test set (10%). We retrain StarGAN and GANimation [46] under exactly the same settings in order to make a fair comparison.
4.3 Implementation Details
We use an ensemble of three different convolutional networks: Generator, Discriminator, and a Domain Embedding (DE).
Similar to previous methods [23, 34], we assume the style to be drawn from a prior Gaussian distribution with 0 mean and identity variance, namely . Therefore, the DE takes this 20-dimensional style vector and the -dimensional target domain (one hot encoded) as inputs to produce the corresponding AdaIN number of parameters.
We provide a more detailed description of the architecture of our networks and training details in the supplementary material.
5 Results
We quantitatively and qualitatively demonstrate the effectiveness of SMIT in several settings. First, we perform ablation experiments, then we show qualitative results over different datasets, and finally we perform an extensive quantitative evaluation and compare our results against the state-of-the-art.
5.1 Ablation Study
We establish different baselines that define the main components of our framework: DE learning, removing the style randomness, adding style regularization, and removing the attention mechanism. We perform a qualitative and quantitative comparison for each of them, and we report our findings in Figure 3 and Table 5.1, respectively.
DE learning
Studying DE parameters is one of our main interests as it is the only controller between the style and labels, and the mapped image. We observe that the generator can easily fall in mode collapse if the DE weights are learned, thus producing almost the same images for different styles. In order to overcome this problem, we analyze the DE contribution to the general system either with learned or fixed random parameters. As we can see in Figure 3, SMITDE_learning, learning the DE parameters leads to full mode collapse, since the style has a negligible impact on the AdaIN generator parameters. This behaviour is due to the fact that the gradients that come from the auxiliary classifier force the domain embedding to produce stable outputs, and therefore the same output thanks to the lack of specialized and per domain style regularization. Conversely, by establishing fixed weights on the DE, we guarantee diversity, i.e., from Equation 2 we observe that for different scale and bias, we ensure different behaviour on the normalization, hence different outputs.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Faceapp. http://www.faceapp.com . 2018.
- 2[2] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool. Extreme learned image compression with gans. In CVPR Workshops , 2018.
- 3[3] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. In ICML , 2018.
- 4[4] A. Anoosheh, E. Agustsson, R. Timofte, and L. Van Gool. Combogan: Unrestrained scalability for image domain translation. In CVPR Workshops , 2018.
- 5[5] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450 , 2016.
- 6[6] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem. Finding tiny faces in the wild with generative adversarial network. In CVPR , 2018.
- 7[7] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh. Recycle-gan: Unsupervised video retargeting. In ECCV , 2018.
- 8[8] A. Bansal, Y. Sheikh, and D. Ramanan. Pixelnn: Example-based image synthesis. ar Xiv preprint ar Xiv:1708.05349 , 2017.
