SMIT: Stochastic Multi-Label Image-to-Image Translation

Andr\'es Romero; Pablo Arbel\'aez; Luc Van Gool; Radu Timofte

arXiv:1812.03704·cs.CV·September 6, 2019

SMIT: Stochastic Multi-Label Image-to-Image Translation

Andr\'es Romero, Pablo Arbel\'aez, Luc Van Gool, Radu Timofte

PDF

1 Repo

TL;DR

SMIT introduces a unified framework for multi-label, multimodal image translation that handles unpaired datasets, multiple attributes, and style diversity using a single generator and domain embeddings.

Contribution

It presents a novel joint approach combining diversity and multi-mapping in image translation with a single generator and domain embeddings, addressing multiple challenges simultaneously.

Findings

01

Outperforms state-of-the-art in multi-label and multimodal translation tasks.

02

Effectively handles continuous style and label interpolation.

03

Generalizes well across different datasets and scenarios.

Abstract

Cross-domain mapping has been a very active topic in recent years. Given one image, its main purpose is to translate it to the desired target domain, or multiple domains in the case of multiple labels. This problem is highly challenging due to three main reasons: (i) unpaired datasets, (ii) multiple attributes, and (iii) the multimodality (e.g., style) associated with the translation. Most of the existing state-of-the-art has focused only on two reasons, i.e. either on (i) and (ii), or (i) and (iii). In this work, we propose a joint framework (i, ii, iii) of diversity and multi-mapping image-to-image translations, using a single generator to conditionally produce countless and unique fake images that hold the underlying characteristics of the source image. Our system does not use style regularization, instead, it uses an embedding representation that we call domain embedding for both…

Tables15

Table 1. Table 1: Feature comparison with state-of-the-art approaches in I2I translation. SMIT uses a single generator trained with unpaired data to produce disentangled representations of a multi-targeted domain.

	CycleGAN	BiCycleGAN	StarGAN	MUNIT&alike	DRIT	GANimation	SMIT
	[55]	[56]	[12]	[23, 3, 39]	[34]	[46]	(ours)
Unpaired Training	✓		✓	✓	✓	✓	✓
\rowcolor[HTML]F3F7FF Multimodal Generation		✓		✓	✓		✓
Multiple Attributes			✓			✓	✓
\rowcolor[HTML]F3F7FF One Single Generator			✓			✓	✓
Fine-grained Transformation			✓			✓	✓
\rowcolor[HTML]F3F7FF Continuous Label Interpolation						✓	✓
Style Transformation				✓	✓		✓
\rowcolor[HTML]F3F7FF Style Interpolation				✓	✓		✓
Attention Mechanism						✓	✓

Table 2. Table 2: Ablation quantitative evaluation. We report the diversity (D) and the partial diversity (PD) for every ablation study in our method.

	Yosemite [26]
	D	PD
\rowcolor[HTML]F3F7FF SMIT_{no_style}	0.412 $\pm$ 0.046	-
SMIT_{DE_learning}	0.413 $\pm$ 0.044	0.004 $\pm$ 0.003
\rowcolor[HTML]F3F7FF SMIT_{no_atention}	0.406 $\pm$ 0.041	0.105 $\pm$ 0.071
SMIT_{style_encoder}	0.418 $\pm$ 0.043	0.133 $\pm$ 0.063
\rowcolor[HTML]F3F7FF SMIT			0.419 $\pm$ 0.048	0.145 $\pm$ 0.072

Table 3. Table 3: Multimodal quantitative evaluation. We report the LPIPS score to compare the diversity (D) and partial diversity (PD) with respect to the multimodal approaches. Better results are boldfaced according to their significant values.

	Edges2Shoes [26]		Edges2Handbags [26]		Yosemite [26]		# Parameters
	D	PD	D	PD	D	PD	(Generator)
\rowcolor[HTML]F3F7FF CycleGAN [55]	0.272 $\pm$ 0.048	-	0.293 $\pm$ 0.081	-	0.272 $\pm$ 0.048	-	2x11.4M
DRIT [34]	0.237 $\pm$ 0.149	0.028 $\pm$ 0.030	0.296 $\pm$ 0.181	0.056 $\pm$ 0.060	0.398 $\pm$ 0.038	0.126 $\pm$ 0.019	2x21.3M
\rowcolor[HTML]F3F7FF MUNIT [23]	0.295 $\pm$ 0.051	0.077 $\pm$ 0.057	0.365 $\pm$ 0.052	0.123 $\pm$ 0.067	0.335 $\pm$ 0.045	0.208 $\pm$ 0.034	2x15.0M
SMIT (ours)	0.303 $\pm$ 0.058	0.072 $\pm$ 0.056	0.367 $\pm$ 0.048	0.096 $\pm$ 0.072	0.437 $\pm$ 0.041	0.145 $\pm$ 0.072	8.4M
Real Data	0.313 $\pm$ 0.052	-	0.374 $\pm$ 0.051	-	0.447 $\pm$ 0.049	-	-

Table 4. Table 4: Multi-label quantitative evaluation. We report the results for Inception Score (IS), Conditioned Inception Score (CIS), and LPIPS diversity metric (D and PD), for multi-label frameworks.

	RafD [31]
	CIS	IS	D	PD
\rowcolor[HTML]F3F7FF StarGAN [12]	1.00 $\pm$ 0.00	1.66 $\pm$ 0.38	0.15 $\pm$ 0.01	-
GANimation[46]	1.00 $\pm$ 0.00	1.51 $\pm$ 0.33	0.16 $\pm$ 0.01	-
\rowcolor[HTML]F3F7FF SMIT (ours)	1.25 $\pm$ 0.06	2.51 $\pm$ 0.70	0.17 $\pm$ 0.01	0.004 $\pm$ 0.001
Real Data	-	1.18 $\pm$ 0.18	0.16 $\pm$ 0.01	-

Table 5. Table 5: SMIT Generator network architecture.

Part	Input $\to$ Output Shape	Layer Information
Down-sampling	$(256, 256, 3) \to (256, 256, 32)$	Conv2d(dim=32, kernel=7, stride=1, padding=3), IN, ReLU
	$(256, 256, 32) \to (128, 128, 64)$	Conv2d(64, 4, 2, 1), IN, ReLU
	$(128, 128, 64) \to (64, 64, 128)$	Conv2d(128, 4, 2, 1), IN, ReLU
	$(64, 64, 128) \to (32, 32, 256)$	Conv2d(256, 4, 2, 1), IN, ReLU
Bottleneck	$(32, 32, 256) \to (32, 32, 256)$	Residual Block: Conv2d(256, 3, 1, 1), AdaIN, ReLU
	$(32, 32, 256) \to (32, 32, 256)$	Residual Block: Conv2d(256, 3, 1, 1), AdaIN, ReLU
	$(32, 32, 256) \to (32, 32, 256)$	Residual Block: Conv2d(256, 3, 1, 1), AdaIN, ReLU
	$(32, 32, 256) \to (32, 32, 256)$	Residual Block: Conv2d(256, 3, 1, 1), AdaIN, ReLU
	$(32, 32, 256) \to (32, 32, 256)$	Residual Block: Conv2d(256, 3, 1, 1), AdaIN, ReLU
	$(32, 32, 256) \to (32, 32, 256)$	Residual Block: Conv2d(256, 3, 1, 1), AdaIN, ReLU
Up-sampling	$(32, 32, 256) \to (64, 64, 128)$	Nearest Upsampling (2x), Convd2d(128, 3, 1, 1), LN, ReLU
Up-sampling	$(64, 64, 128) \to (128, 128, 64)$	Nearest Upsampling (2x), Convd2d(64, 3, 1, 1), LN, ReLU
	$(128, 128, 64) \to (256, 256, 32)$	Nearest Upsampling (2x), Convd2d(32, 3, 1, 1), LN, ReLU
Fake Output ( $𝒳_{f}$ )	$(256, 256, 32) \to (256, 256, 3)$	Conv2d(3, 7, 1, 3), None, Tanh
Attention mask ( $ℳ)$	$(256, 256, 32) \to (256, 256, 1)$	Conv2d(1, 7, 1, 3), None, Sigmoid

Table 6. Table 6: SMIT Domain Embedding network architecture.

Layer	Input $\to$ Output Shape	Layer Information
Embedding Projection	$(20 + ℕ_{d}) \to (6144)$	FullyConnected(dim= $6144$ )

Table 7. Table 7: SMIT Discriminator network architecture.

Layer	Input $\to$ Output Shape	Layer Information
Input Layer	$(256, 256, 3) \to (128, 128, 32)$	Conv2d(dim=32, kernel=4, stride=2, padding=1), SN, LReLU
Hidden Layer	$(128, 128, 32) \to (64, 64, 64)$	Conv2d(64, 4, 2, 1), SN, LReLU
Hidden Layer	$(64, 64, 64) \to (32, 32, 128)$	Conv2d(128, 4, 2, 1), SN, LReLU
Hidden Layer	$(32, 32, 128) \to (16, 16, 256)$	Conv2d(256, 4, 2, 1), SN, LReLU
Hidden Layer	$(16, 16, 256) \to (8, 8, 512)$	Conv2d(512, 4, 2, 1), SN, LReLU
Hidden Layer	$(8, 8, 512) \to (4, 4, 1024)$	Conv2d(1024, 4, 2, 1), SN, LReLU
Hidden Layer	$(4, 4, 1024) \to (2, 2, 2048)$	Conv2d(2048, 4, 2, 1), SN, LReLU
Output Layer ( $𝔻_{s r c}$ )	$(2, 2, 2048) \to (2, 2, 1)$	Conv2d(1, 3, 1, 1)
Output Layer ( $𝔻_{c l s}$ )	$(2, 2, 2048) \to (1, 1, ℕ_{d})$	Conv2d( $ℕ_{d}, 2, 1, 0$ )

Table 8. Table 8: Multimodal quantitative evaluation for edges2shoes. We report the LPIPS score to compare the diversity (D) and partial diversity (PD) for each domain independently, in comparison with multimodal frameworks. We retrain CycleGAN, DRIT and MUNIT for these results.

	Edges2Shoes
	Edges		Shoes
	D	PD	D	PD
\rowcolor[HTML]F3F7FF CycleGAN	0.269 $\pm$ 0.046	-	0.275 $\pm$ 0.050	-
DRIT	0.000 $\pm$ 0.000	0.000 $\pm$ 0.000	0.243 $\pm$ 0.052	0.056 $\pm$ 0.017
\rowcolor[HTML]F3F7FF MUNIT	0.269 $\pm$ 0.049	0.027 $\pm$ 0.005	0.263 $\pm$ 0.049	0.126 $\pm$ 0.039
SMIT (ours)	0.274 $\pm$ 0.046	0.020 $\pm$ 0.006	0.261 $\pm$ 0.060	0.123 $\pm$ 0.029
Real Data	0.274 $\pm$ 0.046	-	0.293 $\pm$ 0.051	-

Table 9. Table 9: Multimodal quantitative evaluation for edges2handbags. We report the LPIPS score to compare the diversity (D) and partial diversity (PD) for each domain independently, in comparison with multimodal frameworks. We retrain CycleGAN, DRIT and MUNIT for these results.

	Edges2Handbags
	Edges		Handbags
	D	PD	D	PD
\rowcolor[HTML]F3F7FF CycleGAN	0.225 $\pm$ 0.043	-	0.361 $\pm$ 0.045	-
DRIT	0.000 $\pm$ 0.000	0.000 $\pm$ 0.000	0.344 $\pm$ 0.061	0.112 $\pm$ 0.032
\rowcolor[HTML]F3F7FF MUNIT	0.352 $\pm$ 0.045	0.063 $\pm$ 0.016	0.334 $\pm$ 0.052	0.183 $\pm$ 0.039
SMIT (ours)	0.373 $\pm$ 0.041	0.029 $\pm$ 0.010	0.346 $\pm$ 0.048	0.164 $\pm$ 0.035
Real Data	0.346 $\pm$ 0.045	-	0.370 $\pm$ 0.053	-

Table 10. Table 10: Multimodal quantitative evaluation for edges2objects. We report the LPIPS score to compare the diversity (D) and partial diversity (PD) for each domain independently, in comparison with multimodal frameworks. Due to the multi-label nature, SMIT is the only one that is suitable for this task.

	Edges2Objects
	Edges Shoes		Shoes		Edges Handbags		Handbags
	D	PD	D	PD	D	PD	D	PD
\rowcolor[HTML]F3F7FF CycleGAN	-	-	-	-	-	-	-	-
DRIT	-	-	-	-	-	-	-	-
\rowcolor[HTML]F3F7FF MUNIT	-	-	-	-	-	-	-	-
SMIT (ours)	0.130 $\pm$ 0.104	0.055 $\pm$ 0.024	0.286 $\pm$ 0.07	0.168 $\pm$ 0.028	0.279 $\pm$ 0.045	0.012 $\pm$ 0.008	0.304 $\pm$ 0.052	0.233 $\pm$ 0.060
Real Data	0.274 $\pm$ 0.046	-	0.293 $\pm$ 0.051	-	0.346 $\pm$ 0.045	-	0.370 $\pm$ 0.053	-

Table 11. Table 11: Multimodal quantitative evaluation for Yosemite. We report the LPIPS score to compare the diversity (D) and partial diversity (PD) for each domain independently, in comparison with multimodal frameworks. We retrain CycleGAN, DRIT and MUNIT for these results.

	Yosemite
	Summer		Winter
	D	PD	D	PD
\rowcolor[HTML]F3F7FF CycleGAN	0.408 $\pm$ 0.037	-	0.406 $\pm$ 0.041	-
DRIT	0.405 $\pm$ 0.033	0.120 $\pm$ 0.018	0.395 $\pm$ 0.040	0.131 $\pm$ 0.020
\rowcolor[HTML]F3F7FF MUNIT	0.372 $\pm$ 0.034	0.212 $\pm$ 0.029	0.313 $\pm$ 0.035	0.204 $\pm$ 0.037
SMIT (ours)	0.378 $\pm$ 0.048	0.167 $\pm$ 0.070	0.410 $\pm$ 0.049	0.129 $\pm$ 0.069
Real Data	0.444 $\pm$ 0.055	-	0.444 $\pm$ 0.040	-

Table 12. Table 12: Multi-label quantitative evaluation for RafD. We report the Conditional Inception Score (CIS) for each domain independently, in comparison with multi-label frameworks. We retrain StarGAN and GANimation for these results.

	RafD
	Conditional Inception Score (CIS)
	Neutral	Anger	Contempt	Disgust	Fear	Happy	Sad	Surprise
\rowcolor[HTML]F3F7FF StarGAN	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
GANimation	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
\rowcolor[HTML]F3F7FF SMIT (ours)	1.201	1.187	1.197	1.237	1.329	1.373	1.249	1.201

Table 13. Table 13: Multi-label quantitative evaluation for RafD. We report the Inception Score (IS) for each domain independently, in comparison with multi-label frameworks. We retrain StarGAN and GANimation for these results.

	RafD
	Inception Score (IS)
	Neutral	Anger	Contempt	Disgust	Fear	Happy	Sad	Surprise
\rowcolor[HTML]F3F7FF StarGAN	2.039	1.407	2.194	1.081	1.748	1.483	2.060	1.275
GANimation	1.559	1.320	2.024	1.115	1.427	1.698	1.888	1.033
\rowcolor[HTML]F3F7FF SMIT (ours)	3.502	2.246	3.441	1.598	2.451	2.327	3.009	1.527
Real Data	1.120	1.439	1.401	1.001	1.360	1.001	1.126	1.007

Table 14. Table 14: Multi-label quantitative evaluation for RafD. We report the LPIPS diversity metric (D) for each domain independently, in comparison with multi-label frameworks. We retrain StarGAN and GANimation for these results.

	RafD
	Diversity (D)
	Neutral	Anger	Contempt	Disgust	Fear	Happy	Sad	Surprise
\rowcolor[HTML]F3F7FF StarGAN	0.157	0.154	0.152	0.152	0.152	0.150	0.149	0.150
GANimation	0.156	0.156	0.154	0.156	0.156	0.157	0.159	0.160
\rowcolor[HTML]F3F7FF SMIT (ours)	0.164	0.161	0.162	0.163	0.163	0.164	0.165	0.170
Real Data	0.167	0.165	0.166	0.166	0.166	0.167	0.167	0.167

Table 15. Table 15: Multi-label quantitative evaluation for RafD. We report the LPIPS partial diversity metric (PD) for each domain independently, in comparison with multi-label frameworks. We retrain StarGAN and GANimation for these results.

	RafD
	Partial Diversity (PD)
	Neutral	Anger	Contempt	Disgust	Fear	Happy	Sad	Surprise
\rowcolor[HTML]F3F7FF StarGAN	-	-	-	-	-	-	-
GANimation	-	-	-	-	-	-	-
\rowcolor[HTML]F3F7FF SMIT (ours)	0.003	0.004	0.003	0.004	0.004	0.004	0.003	0.005

Equations24

G (X_{r}, y_{f}, s_{f}) \to X_{f} \in R^{H \times W \times 3}

G (X_{r}, y_{f}, s_{f}) \to X_{f} \in R^{H \times W \times 3}

A d a I N (x, z) = z_{w} \frac{x - μ ( x )}{σ ( x )} + z_{b}

A d a I N (x, z) = z_{w} \frac{x - μ ( x )}{σ ( x )} + z_{b}

z = D E (y, s)

L_{D} = D_{sr c} (X_{r}) - ∣∣ D_{sr c} (X_{f}) ∣ ∣_{1}

L_{D} = D_{sr c} (X_{r}) - ∣∣ D_{sr c} (X_{f}) ∣ ∣_{1}

L_{G} = D_{sr c} (X_{f}) - ∣∣ D_{sr c} (X_{r}) ∣ ∣_{1}

L_{a d v} = L_{D} + L_{G}

L_{c l s} = D_{c l s} (X) lo g (y) + (1 - D_{c l s} (X)) lo g (1 - y)

L_{c l s} = D_{c l s} (X) lo g (y) + (1 - D_{c l s} (X)) lo g (1 - y)

X_{r} \approx X_{r ec} = G (G (X_{r}, y_{f}, s_{f}), y_{r}, s_{r})

X_{r} \approx X_{r ec} = G (G (X_{r}, y_{f}, s_{f}), y_{r}, s_{r})

X_{r ec} = G (G (X_{r}, y_{f}, s_{f}), y_{r}, s_{f}^{'})

X_{r ec} = G (G (X_{r}, y_{f}, s_{f}), y_{r}, s_{f}^{'})

L_{r ec} = ∣∣ X_{r} - X_{r ec} ∣ ∣_{1}

[X_{f} \in R^{H \times W \times 3}, M \in R^{H \times W}] = G (X_{r}, y_{f}, s_{f})

[X_{f} \in R^{H \times W \times 3}, M \in R^{H \times W}] = G (X_{r}, y_{f}, s_{f})

X_{f} = M \cdot X_{r} + (1 - M) \cdot X_{f}

L_{a tt n} = ∣∣ M ∣ ∣_{1}

L_{i d t} = ∣∣ X_{r} - (G (X_{r}, y_{r}, s_{f}^{''})) ∣ ∣_{1}

L_{i d t} = ∣∣ X_{r} - (G (X_{r}, y_{r}, s_{f}^{''})) ∣ ∣_{1}

L = λ_{a d v} L_{a d v} + λ_{c l s} L_{c l s} + λ_{r ec} L_{r ec} + λ_{a tt n} L_{a tt n} + λ_{i d t} L_{i d t}

L = λ_{a d v} L_{a d v} + λ_{c l s} L_{c l s} + λ_{r ec} L_{r ec} + λ_{a tt n} L_{a tt n} + λ_{i d t} L_{i d t}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BCV-Uniandes/SMIT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\extrafloats

100

SMIT: Stochastic Multi-Label Image-to-Image Translation

Andrés Romero

BCV Lab

Universidad de Los Andes

[email protected]

Pablo Arbeláez

BCV Lab

Universidad de Los Andes

[email protected]

Luc Van Gool

ETH Zürich

KU Leuven

[email protected]

Radu Timofte

CV Lab

ETH Zürich

[email protected]

Abstract

Cross-domain mapping has been a very active topic in recent years. Given one image, its main purpose is to translate it to the desired target domain, or multiple domains in the case of multiple labels. This problem is highly challenging due to three main reasons: (i) unpaired datasets, (ii) multiple attributes, and (iii) the multimodality (e.g. style) associated with the translation. Most of the existing state-of-the-art has focused only on two reasons i.e., either on (i) and (ii), or (i) and (iii). In this work, we propose a joint framework (i, ii, iii) of diversity and multi-mapping image-to-image translations, using a single generator to conditionally produce countless and unique fake images that hold the underlying characteristics of the source image. Our system does not use style regularization, instead, it uses an embedding representation that we call domain embedding for both domain and style. Extensive experiments over different datasets demonstrate the effectiveness of our proposed approach in comparison with the state-of-the-art in both multi-label and multimodal problems. Additionally, our method is able to generalize under different scenarios: continuous style interpolation, continuous label interpolation, and fine-grained mapping. Code and pretrained models are available at https://github.com/BCV-Uniandes/SMIT.

1 Introduction

The ability of humans to easily imagine how a black haired person would look like if they were blond, or with a different type of eyeglasses, or to imagine a winter scene as summer is formulated as the image-to-image (I2I) translation problem in the computer vision community. Since the recent introduction of Generative Adversarial Networks (GANs) [19], a plethora of problems such as video analysis [51, 7], super resolution [33, 9], semantic synthesis [26, 10], photo enhancement [24, 25], photo editing [49, 14], and most recently domain adaptation [21, 43] have been addressed as I2I translation problems.

Initially, translating from one domain into another required paired datasets that exactly matched both domains [26] e.g., edges $\leftrightarrow$ shoes or edges $\leftrightarrow$ handbags datasets. However, this approach is unpractical because the full representation of the cross-domain mapping is, in most cases, intractable. Existing techniques try to perform deterministic I2I translation with unpaired images to map from one domain into another (one-to-one) [55, 4, 37, 25], or into multiple domains (one-to-many) [12, 46, 20]. Nevertheless, many problems are fundamentally stochastic as there are countless mappings from one domain to another e.g., a day $\leftrightarrow$ night or cat $\leftrightarrow$ dog translation.

Recent techniques [34, 23, 39] have successfully addressed the multimodal representation for one-to-one domain translation. These methods are based on the idea developed on traditional I2I approaches [55, 56], in which the generator tends to overlook a noise injection. As a consequence, these techniques studied the problem of disentangling representation as style transfer, including a shared content space representation and a style encoder network.

In this paper, we propose Stochastic Multi-Label Image-to-Image Translation (SMIT), a novel and robust framework that includes multiple labels and diversity, and does not require either style or content regularization. Moreover, we build our entire approach using a single generator that does not ignore the noise perturbation, i.e. for different level of noise our method produces different styles with the underlying characteristics and structure of the target domain111Hereafter, we refer to domains as the number of labels per dataset, and style as the diversity induced by noise.. As illustrated in Figure 1, SMIT learns a full distribution for each attribute, so it can perform diverse translation for different fine-grained or broader attributes. It is important to remark that in contrast to [12, 46, 30] the trainable parameters in the SMIT generator are not label-dependent, that is there is a negligible difference either on computational time or on memory consumption when learning as many as 40 attributes instead of just 2 labels. Figure 2 presents an overview of our model. We radically depart from mainstream approaches [12, 46, 30], where the target domain is inserted through the spatial concatenation, instead we indirectly inject the style and the target labels through Adaptive Instance Normalization (AdaIN) [22] layers in the generator, and the discriminator aims at recovering only the labels, i.e. we remark the importance of no style regularization.

We perform a comprehensive quantitative evaluation of SMIT either for disentanglement or multiple domain I2I problems, demonstrating the advantages of our method in comparison with existing state-of-the-art models. We also show qualitative results on several datasets that validate the effectiveness of our approach under varied and challenging settings.

More precisely, our main contribution is to propose a single and end-to-end system with an agnostic-domain generator capable of performing style transformation, multi-label mapping, style interpolation, and continuous label interpolation with no need of style regularization. For reproducibility, we plan to release our source code and trained models.

2 Related Work

Generative Adversarial Networks (GANs) [19] have proven to be a powerful approach to learn statistical data distributions. GANs rely on game theory where there are two networks (discriminator and generator) optimizing a Minimax function, a training scheme also known as adversarial training. The discriminator learns to distinguish real images from fake ones produced by the generator, and the generator learns to fool the discriminator by producing realistic fake images. Since their introduction, GANs have provided remarkable results in several computer vision problems, such as image generation [47, 11, 29], image translation [26, 55, 3, 37], video translation [51, 7] and resolution enhancement [6, 33, 2]. As our approach lies in the domain of image-to-image translation, it is the focus of our related work review.

Conditional GANs (cGANs)

In vanilla GANs [19], the information regarding the domain is unknown. Conversely, on conditional GANs (cGANs) [44], the discriminator not only distinguishes between real and fake, but it also trains an auxiliary classifier for the conditional data distribution. cGANs have been applied in image-to-image translation problems for semantic layouts [26, 10], super resolution [33], photo editing [49], and for multi-target domains [12, 30, 46]. While traditional cGANs exploit the underlying conditional distribution of the data, they are constrained to produce deterministic outputs, i.e. given an input and a target label, the output is always the same. In comparison, our approach introduces a style randomness in the generation process.

Image-to-Image Translation (I2I)

Isola et al. [26] introduced a framework in which they trained cGANs using paired datasets. This work led to a new set of previously unexplored I2I problems. Based on these findings, Zhu et al. [55] extended the framework by introducing the cycle-consistency loss, which allowed to perform cross-domain mapping using unpaired datasets. Although CycleGAN [55] is currently one of the most common backbones for I2I models and frameworks, it is constrained to one-to-one domain translation, hence it needs one generator per domain. In contrast, our method uses a single generator regardless of the number of domains.

Other works [12, 46] extended the cycle-consistency insight in order to cope with multiple domains, by using a single generator. These methods take the label as independent features to the first layer of the generator, hence constraining the generator weights to restricted applications. Similarly, additional methods [30, 20] tackled the multilabel mapping problem from a VAE-GAN [32] perspective. Our approach neither uses a variational autoencoder representation nor does it depend on label weights, since the generator has always the same number of parameters regardless of the application.

Disentangled Representations

A recurrent limitation in traditional I2I methods is their deterministic output. In image generation problems [47, 11, 28], disentangled representations are achieved by injecting random noise in the generator. Nevertheless, this idea cannot be used on the seminal CycleGAN, as this framework learns to ignore the noise vector due to the lack of regularization [55].

Recently, there have been efforts [10, 56, 8] to produce diverse representations from a single input. For instance, BiCycleGAN [56] bypassed the regularization issues of CycleGAN and it included a random noise vector in the training scheme, thus generating images of higher quality than CycleGAN. However, this approach requires paired data to train, which makes it unfeasible to scale in real-world scenarios.

Furthermore, generating multimodal images can also be studied as a problem of style transfer [17, 18] between two images. Inspired by the work of Gatys et al. [17], recent approaches [23, 39, 34] split the generator encoder into a two-stream content and style encoder, where the content stream extracts the underlying structure, shape and main information to be preserved on the image, and the style one draws the rendering attributes it aims at transferring. These disentangled representations are similar in spirit with the CycleGAN cycle-consistency adversarial loss since they perform a cross-domain mapping for the style and content space. Consequently, it is difficult to perform fine-grained translations. In comparison, our proposed approach does not suffer in this regard, since we neither constrain the content nor the style distributions. Moreover, as the experiments will show, SMIT is suitable for both coarser translations and subtle local appearances e.g., art in-painting or facial expressions, respectively.

Continuous Interpolation

On the one hand, Pumarola et al. [46] introduced a cGAN framework that takes as input continuous rather than discrete labels. This approach enables the generation of examples with continuous labels at inference time, however, it does not handle diversity for the same input. On the other hand, for binary problems, Lee et al. [34] and Huang et al. [23] performed continuous interpolation between two styles in order to produce a pseudo-animated style transferring with images that belong to the same domain. Our work uses both target and style continuous interpolation.

Table 1 summarizes our main differences with respect to the literature for either multi-label or multimodal translation. SMIT has richer capabilities that those of existing methods as we perform fine-grained local transformation, style transformation, continuous style interpolation, continuous label interpolation, and multi-label transferring using one single generator.

3 Stochastic Multi-Label Image-to-Image Translation (SMIT)

Our final goal is to generate multi-attribute images with different styles using a single generator. As illustrated in Figure 2, our method is an ensemble of three different networks: a generator, a discriminator, and a domain embedding (DE). The generator takes the source image as input and translates it. The discriminator does not only differentiate between real and fake samples, but it also approximates the output distribution of the real target by means of an auxiliary classifier. Finally, SMIT uses the DE to merge both target style and target labels into the generator.

3.1 Problem Formulation

Let $\mathcal{X}_{r}\in\mathbb{R}^{H\times{W}\times{3}}$ be the real image. $\mathcal{X}_{r}$ is encoded by a set of $N$ discrete or continuous labels $y_{r}\in\mathbb{R}^{N}$ . Additionally, for each possible $\mathcal{X}_{r}$ , there is an unknown style distribution $s_{r}\in\mathbb{R}^{S}$ . Given a target label $y_{f}$ , and a target style $s_{f}$ , we want to learn a mapping function $\mathbb{G}$ to produce a fake image $\mathcal{X}_{f}$ , without having access to the joint distribution $p(\mathcal{X}_{r},\mathcal{X}_{f})$ :

[TABLE]

As it is common in cGANs [12, 46, 11, 47], we have a discriminator $\mathbb{D}$ that outputs the source domain probability, i.e. true or fake, and a classification/regression estimator, namely, $\mathbb{D}(\mathcal{X}_{f})\rightarrow{\{0,y_{f}\}}$ and $\mathbb{D}(\mathcal{X}_{r})\rightarrow{\{1,y_{r}\}}$ .

3.2 Model

Generator ( $\mathbb{G}$ )

We build upon the CycleGAN generator [55]. It is inspired in an encoder-decoder architecture, which consists of down-sampling layers, residual blocks, and up-sampling layers. Importantly, we use Instance Normalization (IN) [15, 52], Adaptive Instance Normalization (AdaIN) [22], and Layer Normalization (LN) [5] for the three stages, respectively. The main reason we only use IN during the first stage and not in the up-sampling is because they introduce undesirable properties to the global mean and variance that are modified by AdaIN in the residual Layers.

Domain Embedding (DE)

We indirectly input the target attribute and the style randomness through AdaIN [22] weights. AdaIN normalization is computed from Equation 2, where $x$ is the input and $z$ are the adaptive parameters.

[TABLE]

As the AdaIN parameters depend entirely on the number of feature maps of the input $x$ , they are agnostic to both style and label domains, which makes the generator entirely label and style independent. This key property makes SMIT highly suitable for transfer learning, addressing a drawback of cGANs in real-world scenarios.

It is important to mention that since the style and label dimensions may differ from the $z$ dimensions, we use a projection embedding representation to encode style and label inputs to a fixed size suitable for AdaIN (Equation 3).

We remark that the DE does not require any training scheme, instead it is inspired by Language Modeling methods [40, 13, 36, 41, 45] that uses random initialization to map the input to a space embedding distribution. Particularly, we use a simple random embedding, i.e. a fully connected layer to map from style and labels concatenation to the AdaIN parameters. Our rationale is as follows: By always ensuring different $z$ , we guarantee different normalization parameters, which means different fake images. We study the DE behaviour in more detail in Section 5.1.

Discriminator ( $\mathbb{D}$ )

As previously stated, the discriminator has two outputs: source domain (src) and auxiliary classifier (cls). First, we use the idea of patch-GAN [26], to tell whether the source is fake or true based on a patch rather than a single number ( $\mathbb{D}_{src}$ ). Second, we have a binary cross entropy loss function for the conditional labels ( $\mathbb{D}_{cls}$ ). If continuous labels are used, then a regression objective loss should be applied. However, as we will discuss Section 5.2, our approach is capable of generating continuous labels even if it was trained with discrete ones.

3.2.1 Training Framework

In order to approximate function $\mathbb{G}$ in Equation 1, we split our general loss function for clarity.

Adversarial Loss

We use the recently introduced averaged Relativistic Adversarial Loss (RGAN) [27] and the hinge version [42] loss to train the adversarial loss. RGAN relies on the idea that the discriminator not only estimates whether images are real or fake, but it also estimates the probability that the given real images are more realistic than the fake ones.

[TABLE]

Conditional Loss

The adversarial loss does not include any regularization for the conditional labels, yet the generator must be able to produce both realistic and conditioned images. To solve this issue, we define the conditional loss as:

[TABLE]

Recovery Loss

In order to produce $\mathcal{X}_{f}$ , we jointly input the target label and the target style. Therefore, the cycle consistency loss employed to recover the original image can be naively defined as:

[TABLE]

Note that the original style ( $s_{r}$ ) is an unknown parameter. Nonetheless, we assume that $s_{r}$ is drawn from a known normal distribution, and therefore reformulate the reconstruction loss by adding a different random style $s_{f}^{\prime}$ . We assume random styles during the whole training process. Thus, we compute the reconstruction or cycle consistency loss as:

[TABLE]

Attention Loss

Until this point, there is no guarantee that the output of our generator will preserve background details e.g., the underlying structure, or the identity of a person. To solve this particular issue, we regularize our model with the unsupervised attention mechanism proposed by Pumarola et al. [46]. We add a new and parallel layer to the generator output ( $\mathcal{X}_{f}$ ) that works as the attention mask ( $\mathcal{M}$ ).

The attention loss encourages fake images to change only certain regions with respect to the real input, and it is decomposed by the following terms:

[TABLE]

Identity Loss

To further stabilize the training framework, we regularize our model with the identity loss that is defined as follows:

[TABLE]

Overall Loss

We define our full objective function in Equation 9, as the weighed sum of the previous losses:

[TABLE]

Remarkably, our method does not require style regularization [23, 34] since we use a training framework that can easily bypass it.

4 Experimental Setup

We validate our method over several and very different datasets and tasks, such as instance facial synthesis [38], emotion recognition [31], Yosemite summer $\leftrightarrow$ winter [26], and edges-to-object generation [26].

In the supplementary material, we extend our qualitative results to painters [4], Alps seasons [4], RafD [31], BP4D [54], EmotionNet [16], and full CelebA [38] with 40 attributes.

4.1 Evaluation Metrics

Diverse Translation

The LPIPS metric [53] allows us to quantify the similarity between two different images. LPIPS computes the L2 distance between pairs of deep features (e.g., AlexNet, VGG, etc) images.

Multi-label Translation

Besides the LPIPS score, we also compute the Inception Score (IS) [48] that is a popular score for I2I problems. The IS employs an Inception Network [50] to classify fake images and thus rank them according to their scores with respect to the prior distribution. Additionally, we report the Conditional Inception Score (CIS) [23] that quantifies both high quality and diverse mapping.

4.2 Evaluation Framework

Given the unique nature of our approach, we unfold the quantitative evaluation into two different schemes: multimodal evaluation, and multi-label evaluation.

Multimodal Evaluation

We directly use MUNIT [23] and DRIT [34] to compare our method in GAN-based disentangled representations. For fair comparison under this setting, we work within the same datasets Edges [26] and Yosemite [55]. To this end, we train MUNIT and DRIT and report the corresponding LPIPS over the whole test set.

We use the LPIPS score to measure the diversity of the generated images. As there is no standard evaluation framework for the diversity in GAN-based problems, we use a set of two metrics. First, as in MUNIT, we compute the diversity one-vs-all across the entire dataset (D), using the diversity in the real data as a reference. Then, we use one single fixed style to produce the cross-mapping in order to compute the diversity along the entire fake dataset. Second, as in DRIT, given a single image, we measure the partial diversity (PD) across different modalities (20 different styles) and report the average and standard deviation over each image, over the whole set.

Multi-label Evaluation

Additionally, for purely multi-label I2I methods, we train an Inception network [50] on a RafD train set (90%) and report the IS and CIS over the remaining test set (10%). We retrain StarGAN and GANimation [46] under exactly the same settings in order to make a fair comparison.

4.3 Implementation Details

We use an ensemble of three different convolutional networks: Generator, Discriminator, and a Domain Embedding (DE).

Similar to previous methods [23, 34], we assume the style to be drawn from a prior Gaussian distribution with 0 mean and identity variance, namely $\mathcal{N}(0,I)$ . Therefore, the DE takes this 20-dimensional style vector and the $N$ -dimensional target domain (one hot encoded) as inputs to produce the corresponding AdaIN number of parameters.

We provide a more detailed description of the architecture of our networks and training details in the supplementary material.

5 Results

We quantitatively and qualitatively demonstrate the effectiveness of SMIT in several settings. First, we perform ablation experiments, then we show qualitative results over different datasets, and finally we perform an extensive quantitative evaluation and compare our results against the state-of-the-art.

5.1 Ablation Study

We establish different baselines that define the main components of our framework: DE learning, removing the style randomness, adding style regularization, and removing the attention mechanism. We perform a qualitative and quantitative comparison for each of them, and we report our findings in Figure 3 and Table 5.1, respectively.

DE learning

Studying DE parameters is one of our main interests as it is the only controller between the style and labels, and the mapped image. We observe that the generator can easily fall in mode collapse if the DE weights are learned, thus producing almost the same images for different styles. In order to overcome this problem, we analyze the DE contribution to the general system either with learned or fixed random parameters. As we can see in Figure 3, SMITDE_learning, learning the DE parameters leads to full mode collapse, since the style has a negligible impact on the AdaIN generator parameters. This behaviour is due to the fact that the gradients that come from the auxiliary classifier force the domain embedding to produce stable outputs, and therefore the same output thanks to the lack of specialized and per domain style regularization. Conversely, by establishing fixed weights on the DE, we guarantee diversity, i.e., from Equation 2 we observe that for different scale and bias, we ensure different behaviour on the normalization, hence different outputs.

Bibliography56

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Faceapp. http://www.faceapp.com . 2018.
2[2] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool. Extreme learned image compression with gans. In CVPR Workshops , 2018.
3[3] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. In ICML , 2018.
4[4] A. Anoosheh, E. Agustsson, R. Timofte, and L. Van Gool. Combogan: Unrestrained scalability for image domain translation. In CVPR Workshops , 2018.
5[5] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450 , 2016.
6[6] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem. Finding tiny faces in the wild with generative adversarial network. In CVPR , 2018.
7[7] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh. Recycle-gan: Unsupervised video retargeting. In ECCV , 2018.
8[8] A. Bansal, Y. Sheikh, and D. Ramanan. Pixelnn: Example-based image synthesis. ar Xiv preprint ar Xiv:1708.05349 , 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

SMIT: Stochastic Multi-Label Image-to-Image Translation

Abstract

1 Introduction

2 Related Work

Conditional GANs (cGANs)

Image-to-Image Translation (I2I)

Disentangled Representations

Continuous Interpolation

3 Stochastic Multi-Label Image-to-Image Translation (SMIT)

3.1 Problem Formulation

3.2 Model

Generator (G\mathbb{G}G)

Domain Embedding (DE)

Discriminator (D\mathbb{D}D)

3.2.1 Training Framework

Adversarial Loss

Conditional Loss

Recovery Loss

Attention Loss

Identity Loss

Overall Loss

4 Experimental Setup

4.1 Evaluation Metrics

Diverse Translation

Multi-label Translation

4.2 Evaluation Framework

Multimodal Evaluation

Multi-label Evaluation

4.3 Implementation Details

5 Results

5.1 Ablation Study

DE learning

Generator ( $\mathbb{G}$ )

Discriminator ( $\mathbb{D}$ )