Disentangling Latent Space for VAE by Label Relevant/Irrelevant   Dimensions

Zhilin Zheng; Li Sun

arXiv:1812.09502·cs.CV·March 18, 2019

Disentangling Latent Space for VAE by Label Relevant/Irrelevant Dimensions

Zhilin Zheng, Li Sun

PDF

Open Access 1 Repo

TL;DR

This paper proposes a novel VAE method that disentangles latent space into label relevant and irrelevant parts, improving class-specific representation and avoiding posterior collapse.

Contribution

It introduces a disentangled latent space with class-specific Gaussian mixture distribution and demonstrates theoretical equivalence to mutual information maximization.

Findings

01

Disentangled latent space improves class-specific representation.

02

The method is extendable to GANs for high-quality image synthesis.

03

Theoretical analysis shows equivalence to KL divergence on joint distribution.

Abstract

VAE requires the standard Gaussian distribution as a prior in the latent space. Since all codes tend to follow the same prior, it often suffers the so-called "posterior collapse". To avoid this, this paper introduces the class specific distribution for the latent code. But different from CVAE, we present a method for disentangling the latent space into the label relevant and irrelevant dimensions, $z_{s}$ and $z_{u}$ , for a single input. We apply two separated encoders to map the input into $z_{s}$ and $z_{u}$ respectively, and then give the concatenated code to the decoder to reconstruct the input. The label irrelevant code $z_{u}$ represent the common characteristics of all inputs, hence they are constrained by the standard Gaussian, and their encoder is trained in amortized variational inference way, like VAE. While…

Tables6

Table 1. Table 1: I n c e p t i o n S c o r e 𝐼 𝑛 𝑐 𝑒 𝑝 𝑡 𝑖 𝑜 𝑛 𝑆 𝑐 𝑜 𝑟 𝑒 Inception\ Score of different methods on two datasets. Please refer to 4.2 for more details.

	FaceScrub	CIFAR-10
cVAE [39]	9.55	3.01
cGAN [30]	10.02	6.27
cVAE-GAN [3]	16.75	6.99
ours	17.91	7.04

Table 2. Table 2: Intra-class diversity of different methods on two datasets. Please refer to 4.2 for more details.

	FaceScrub	CIFAR-10
cVAE-GAN [3]	0.0141	0.0136
ours	0.0157	0.0149

Table 3. Table 3: The network structure for Cifar-10.

Encoder	Decoder	Discriminator
input $𝐱 \in ℝ^{32 \times 32 \times 3}$	input $𝐳_{s} \in ℝ^{100}, 𝐳_{u} \in ℝ^{200}$	input $𝐱 \in ℝ^{32 \times 32 \times 3}$
$5 \times 5$ conv, 32, stride 2, batchnorm, relu	concat	$5 \times 5$ conv, 32, stride 1, lrelu
$5 \times 5$ conv, 64, stride 2, batchnorm, relu	fc, 1024, batchnorm, relu	$5 \times 5$ conv, 128, stride 2, lrelu
$3 \times 3$ conv, 128, stride 2, batchnorm, relu	$5 \times 5$ conv, 256, stride 2, batchnorm, relu	$5 \times 5$ conv, 256, stride 2, lrelu
$3 \times 3$ conv, 256, stride 2, batchnorm, relu	$5 \times 5$ conv, 256, stride 1, batchnorm, relu	$5 \times 5$ conv, 256, stride 2, lrelu
fc, 1024, batchnorm, relu	$5 \times 5$ conv, 128, stride 2, batchnorm, relu	fc, 512, lrelu
fc, 100 (for $𝐳_{s}$ ) / 200 (for $𝐳_{u})$	$5 \times 5$ conv, 64, stride 2, batchnorm, relu	fc, 1
	$5 \times 5$ conv, 32, stride 2, batchnorm, relu
	$5 \times 5$ conv, 3, stride 1, tanh

Table 4. Table 4: The network structure of discriminator for FaceScrub.

Discriminator for FaceScrub
input $𝐱 \in ℝ^{64 \times 64 \times 3}$
$3 \times 3$ conv, 64, stride 2, lrelu
$3 \times 3$ conv, 128, stride 2, lrelu
$3 \times 3$ conv, 256, stride 1, lrelu
$3 \times 3$ conv, 256, stride 2, lrelu
$3 \times 3$ conv, 512, stride 1, lrelu
$3 \times 3$ conv, 512, stride 2, lrelu
$3 \times 3$ conv, 512, stride 2, lrelu
global average pooling
fc, 1024, lrelu
fc, 1

Table 5. Table 5: I n c e p t i o n S c o r e 𝐼 𝑛 𝑐 𝑒 𝑝 𝑡 𝑖 𝑜 𝑛 𝑆 𝑐 𝑜 𝑟 𝑒 Inception\ Score of different methods.

	CUB-200-2011	Cifar-100
cVAE [39]	37.34	3.10
cGAN [30]	78.06	6.39
cVAE-GAN [3]	91.14	6.68
ours	100.86	6.70

Table 6. Table 6: Intra-class diversity of different methods.

	CUB-200-2011	Cifar-100
cVAE-GAN [3]	0.0195	0.0179
ours	0.0192	0.0190

Equations56

lo g p (x) \geq E_{q_{ϕ} (z ∣ x)} (lo g p_{θ} (x ∣ z)) - D_{KL} (q_{ϕ} (z ∣ x) ∣∣ p (z))

lo g p (x) \geq E_{q_{ϕ} (z ∣ x)} (lo g p_{θ} (x ∣ z)) - D_{KL} (q_{ϕ} (z ∣ x) ∣∣ p (z))

lo g p (x) = lo g \iint c \sum p (x, z_{s}, z_{u}, c) d z_{s} d z_{u} \geq E_{q_{ψ} (z_{s} ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g p_{θ} (x ∣ z_{s}, z_{u})] - D_{KL} (q_{ϕ} (z_{u} ∣ x) ∣∣ p (z_{u})) - D_{KL} (q_{ψ} (z_{s}, c ∣ x) ∣∣ p (z_{s}, c))

lo g p (x) = lo g \iint c \sum p (x, z_{s}, z_{u}, c) d z_{s} d z_{u} \geq E_{q_{ψ} (z_{s} ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g p_{θ} (x ∣ z_{s}, z_{u})] - D_{KL} (q_{ϕ} (z_{u} ∣ x) ∣∣ p (z_{u})) - D_{KL} (q_{ψ} (z_{s}, c ∣ x) ∣∣ p (z_{s}, c))

L_{k l} = D_{KL} [N (μ, Σ) ∣∣ N (0, I)]

L_{k l} = D_{KL} [N (μ, Σ) ∣∣ N (0, I)]

L_{C}^{a d v} = - E_{q_{ϕ} (z_{u} ∣ x)} c \sum I (c = y) lo g q_{ω} (c ∣ z_{u})

L_{C}^{a d v} = - E_{q_{ϕ} (z_{u} ∣ x)} c \sum I (c = y) lo g q_{ω} (c ∣ z_{u})

L_{E}^{a d v} = - E_{q_{ϕ} (z_{u} ∣ x)} c \sum \frac{1}{C} lo g q_{ω} (c ∣ z_{u})

L_{E}^{a d v} = - E_{q_{ϕ} (z_{u} ∣ x)} c \sum \frac{1}{C} lo g q_{ω} (c ∣ z_{u})

p (z_{s}) = c \sum p (z_{s} ∣ c) p (c) = c \sum N (z_{s}; μ_{c}, Σ_{c}) p (c)

p (z_{s}) = c \sum p (z_{s} ∣ c) p (c) = c \sum N (z_{s}; μ_{c}, Σ_{c}) p (c)

L_{l k d} = - lo g N (\overset{z}{^}_{s}; μ_{y}, Σ_{y})

L_{l k d} = - lo g N (\overset{z}{^}_{s}; μ_{y}, Σ_{y})

L_{c l s} = - E_{q_{ψ} (z_{s} ∣ x)} c \sum I (c = y) lo g q (c ∣ z_{s}) = - lo g \frac{N ( z ^ _{s} ∣ μ _{y} , Σ _{y} ) p ( y )}{\sum _{k} N ( z ^ _{s} ∣ μ _{k} , Σ _{k} ) p ( k )}

L_{c l s} = - E_{q_{ψ} (z_{s} ∣ x)} c \sum I (c = y) lo g q (c ∣ z_{s}) = - lo g \frac{N ( z ^ _{s} ∣ μ _{y} , Σ _{y} ) p ( y )}{\sum _{k} N ( z ^ _{s} ∣ μ _{k} , Σ _{k} ) p ( k )}

L_{GM} = L_{c l s} + λ_{l k d} L_{l k d}

L_{GM} = L_{c l s} + λ_{l k d} L_{l k d}

L_{D}^{a d v} = - E_{x \sim P_{r}} [lo g D_{θ_{d}} (x, c)] - E_{z_{u} \sim N (0, I), z_{s} \sim p (z_{s})} [lo g (1 - D_{θ_{d}} (G (z_{s}, z_{u}), c))]

L_{D}^{a d v} = - E_{x \sim P_{r}} [lo g D_{θ_{d}} (x, c)] - E_{z_{u} \sim N (0, I), z_{s} \sim p (z_{s})} [lo g (1 - D_{θ_{d}} (G (z_{s}, z_{u}), c))]

L_{G D}^{a d v} = - E_{z_{u} \sim N (0, I), z_{s} \sim p (z_{s})} [lo g (D_{θ_{d}} (G (z_{s}, z_{u}), c))]

L_{G D}^{a d v} = - E_{z_{u} \sim N (0, I), z_{s} \sim p (z_{s})} [lo g (D_{θ_{d}} (G (z_{s}, z_{u}), c))]

μ_{t} σ_{t}^{2} = c \sum p (c) μ_{c} = c \sum p (c) σ_{c}^{2} + c \sum p (c) (μ_{c})^{2} - (c \sum p (c) μ_{c})^{2}

μ_{t} σ_{t}^{2} = c \sum p (c) μ_{c} = c \sum p (c) σ_{c}^{2} + c \sum p (c) (μ_{c})^{2} - (c \sum p (c) μ_{c})^{2}

d_{in t r a} (X) = 1 - \frac{1}{∣ X ∣ ^{2}} (x^{'}, x) \in X \times X \sum M S - S S I M (x^{'}, x)

d_{in t r a} (X) = 1 - \frac{1}{∣ X ∣ ^{2}} (x^{'}, x) \in X \times X \sum M S - S S I M (x^{'}, x)

lo g p (x) = lo g \iint p (x, z_{s}, z_{u}) d z_{s} d z_{u} \geq E_{q_{ψ} (z_{s} ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g p_{θ} (x ∣ z_{s}, z_{u})] - D_{KL} (q_{ϕ} (z_{u} ∣ x) ∣∣ p (z_{u})) - D_{KL} (q_{ψ} (z_{s}, c ∣ x) ∣∣ p (z_{s}, c))

lo g p (x) = lo g \iint p (x, z_{s}, z_{u}) d z_{s} d z_{u} \geq E_{q_{ψ} (z_{s} ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g p_{θ} (x ∣ z_{s}, z_{u})] - D_{KL} (q_{ϕ} (z_{u} ∣ x) ∣∣ p (z_{u})) - D_{KL} (q_{ψ} (z_{s}, c ∣ x) ∣∣ p (z_{s}, c))

p (x, z_{s}, z_{u}) = c \sum p_{θ} (x ∣ z_{s}, z_{u}) p (z_{s}, c) p (z_{u})

p (x, z_{s}, z_{u}) = c \sum p_{θ} (x ∣ z_{s}, z_{u}) p (z_{s}, c) p (z_{u})

lo g p (x) = lo g \iint p (x, z_{s}, z_{u}) d z_{s} d z_{u} = lo g \iint c \sum p_{θ} (x ∣ z_{s}, z_{u}) p (z_{s}, c) p (z_{u}) d z_{s} d z_{u} = lo g E_{q_{ψ} (z_{s}, c ∣ x), q_{ϕ} (z_{u} ∣ x)} \frac{p _{θ} ( x ∣ z _{s} , z _{u} ) p ( z _{u} ) p ( z _{s} , c )}{q _{ψ} ( z _{s} , c ∣ x ) q _{ϕ} ( z _{u} ∣ x )} \geq E_{q_{ψ} (z_{s}, c ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g \frac{p _{θ} ( x ∣ z _{s} , z _{u} ) p ( z _{u} ) p ( z _{s} , c )}{q _{ψ} ( z _{s} , c ∣ x ) q _{ϕ} ( z _{u} ∣ x )}] = E_{q_{ψ} (z_{s}, c ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g p_{θ} (x ∣ z_{s}, z_{u})] + E_{q_{ψ} (z_{s}, c ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g \frac{p ( z _{u} )}{q _{ϕ} ( z _{u} ∣ x )}] + E_{q_{ψ} (z_{s}, c ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g \frac{p ( z _{s} , c )}{q _{ψ} ( z _{s} , c ∣ x )}] = E_{q_{ψ} (z_{s} ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g p_{θ} (x ∣ z_{s}, z_{u})] - D_{KL} (q_{ϕ} (z_{u} ∣ x) ∣∣ p (z_{u})) - D_{KL} (q_{ψ} (z_{s}, c ∣ x) ∣∣ p (z_{s}, c))

lo g p (x) = lo g \iint p (x, z_{s}, z_{u}) d z_{s} d z_{u} = lo g \iint c \sum p_{θ} (x ∣ z_{s}, z_{u}) p (z_{s}, c) p (z_{u}) d z_{s} d z_{u} = lo g E_{q_{ψ} (z_{s}, c ∣ x), q_{ϕ} (z_{u} ∣ x)} \frac{p _{θ} ( x ∣ z _{s} , z _{u} ) p ( z _{u} ) p ( z _{s} , c )}{q _{ψ} ( z _{s} , c ∣ x ) q _{ϕ} ( z _{u} ∣ x )} \geq E_{q_{ψ} (z_{s}, c ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g \frac{p _{θ} ( x ∣ z _{s} , z _{u} ) p ( z _{u} ) p ( z _{s} , c )}{q _{ψ} ( z _{s} , c ∣ x ) q _{ϕ} ( z _{u} ∣ x )}] = E_{q_{ψ} (z_{s}, c ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g p_{θ} (x ∣ z_{s}, z_{u})] + E_{q_{ψ} (z_{s}, c ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g \frac{p ( z _{u} )}{q _{ϕ} ( z _{u} ∣ x )}] + E_{q_{ψ} (z_{s}, c ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g \frac{p ( z _{s} , c )}{q _{ψ} ( z _{s} , c ∣ x )}] = E_{q_{ψ} (z_{s} ∣ x), q_{ϕ} (z_{u} ∣ x)} [lo g p_{θ} (x ∣ z_{s}, z_{u})] - D_{KL} (q_{ϕ} (z_{u} ∣ x) ∣∣ p (z_{u})) - D_{KL} (q_{ψ} (z_{s}, c ∣ x) ∣∣ p (z_{s}, c))

q_{ψ} (z_{s}, c ∣ x) = q_{ψ} (z_{s} ∣ x) p (c ∣ x)

q_{ψ} (z_{s}, c ∣ x) = q_{ψ} (z_{s} ∣ x) p (c ∣ x)

q_{ψ} (z_{s} ∣ x) = δ (z_{s} - \overset{z}{^}_{s})

q_{ψ} (z_{s} ∣ x) = δ (z_{s} - \overset{z}{^}_{s})

= = = = D_{KL} (q_{ψ} (z_{s}, c ∣ x) ∣∣ p (z_{s}, c)) D_{KL} [δ (z_{s} - \overset{z}{^}_{s}) p (c ∣ x) ∣∣ p (z_{s} ∣ c) p (c)] - c \sum \int δ (z_{s} - \overset{z}{^}_{s}) p (c ∣ x) lo g \frac{p ( z _{s} ∣ c ) p ( c )}{δ ( z _{s} - z ^ _{s} ) p ( c ∣ x )} d z_{s} - c \sum \int δ (z_{s} - \overset{z}{^}_{s}) p (c ∣ x) lo g p (z_{s} ∣ c) d z_{s} - c \sum \int δ (z_{s} - \overset{z}{^}_{s}) p (c ∣ x) lo g \frac{p ( c )}{p ( c ∣ x )} d z_{s} + c \sum \int δ (z_{s} - \overset{z}{^}_{s}) p (c ∣ x) lo g δ (z_{s} - \overset{z}{^}_{s}) d z_{s} - c \sum p (c ∣ x) lo g p (\overset{z}{^}_{s} ∣ c) - c \sum p (c ∣ x) lo g \frac{p ( c )}{p ( c ∣ x )} + \int δ (z_{s} - \overset{z}{^}_{s}) lo g δ (z_{s} - \overset{z}{^}_{s}) d z_{s}

= = = = D_{KL} (q_{ψ} (z_{s}, c ∣ x) ∣∣ p (z_{s}, c)) D_{KL} [δ (z_{s} - \overset{z}{^}_{s}) p (c ∣ x) ∣∣ p (z_{s} ∣ c) p (c)] - c \sum \int δ (z_{s} - \overset{z}{^}_{s}) p (c ∣ x) lo g \frac{p ( z _{s} ∣ c ) p ( c )}{δ ( z _{s} - z ^ _{s} ) p ( c ∣ x )} d z_{s} - c \sum \int δ (z_{s} - \overset{z}{^}_{s}) p (c ∣ x) lo g p (z_{s} ∣ c) d z_{s} - c \sum \int δ (z_{s} - \overset{z}{^}_{s}) p (c ∣ x) lo g \frac{p ( c )}{p ( c ∣ x )} d z_{s} + c \sum \int δ (z_{s} - \overset{z}{^}_{s}) p (c ∣ x) lo g δ (z_{s} - \overset{z}{^}_{s}) d z_{s} - c \sum p (c ∣ x) lo g p (\overset{z}{^}_{s} ∣ c) - c \sum p (c ∣ x) lo g \frac{p ( c )}{p ( c ∣ x )} + \int δ (z_{s} - \overset{z}{^}_{s}) lo g δ (z_{s} - \overset{z}{^}_{s}) d z_{s}

= = D_{KL} (q_{ψ} (z_{s}, c ∣ x) ∣∣ p (z_{s}, c)) - c \sum p (c ∣ x) lo g p (\overset{z}{^}_{s} ∣ c) + C o n s t . - c \sum I (c = y) lo g p (\overset{z}{^}_{s} ∣ c) + C o n s t .

= = D_{KL} (q_{ψ} (z_{s}, c ∣ x) ∣∣ p (z_{s}, c)) - c \sum p (c ∣ x) lo g p (\overset{z}{^}_{s} ∣ c) + C o n s t . - c \sum I (c = y) lo g p (\overset{z}{^}_{s} ∣ c) + C o n s t .

L_{l k d} = - c \sum I (c = y) lo g N (\overset{z}{^}_{s}; μ_{c}, Σ_{c}) = - lo g N (\overset{z}{^}_{s}; μ_{y}, Σ_{y})

L_{l k d} = - c \sum I (c = y) lo g N (\overset{z}{^}_{s}; μ_{c}, Σ_{c}) = - lo g N (\overset{z}{^}_{s}; μ_{y}, Σ_{y})

I (z_{s}; c) = H (c) - H (c ∣ z_{s}) = H (c) + E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} E_{p (c ∣ z_{s})} lo g p (c ∣ z_{s}) = H (c) + E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} E_{p (c ∣ z_{s})} lo g \frac{p ( c ∣ z _{s} )}{q ( c ∣ z _{s} )} + E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} E_{p (c ∣ z_{s})} lo g q (c ∣ z_{s}) \geq H (c) + E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} E_{p (c ∣ z_{s})} lo g q (c ∣ z_{s})

I (z_{s}; c) = H (c) - H (c ∣ z_{s}) = H (c) + E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} E_{p (c ∣ z_{s})} lo g p (c ∣ z_{s}) = H (c) + E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} E_{p (c ∣ z_{s})} lo g \frac{p ( c ∣ z _{s} )}{q ( c ∣ z _{s} )} + E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} E_{p (c ∣ z_{s})} lo g q (c ∣ z_{s}) \geq H (c) + E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} E_{p (c ∣ z_{s})} lo g q (c ∣ z_{s})

= = = E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} E_{p (c ∣ z_{s})} lo g q (c ∣ z_{s}) E_{p (c^{'})} E_{p (x ∣ c^{'})} E_{q_{ψ} (z_{s} ∣ x)} E_{p (c ∣ z_{s})} lo g q (c ∣ z_{s}) c^{'} \sum c \sum \iint p (c^{'}) p (x ∣ c^{'}) q_{ψ} (z_{s} ∣ x) p (c ∣ z_{s}) lo g q (c ∣ z_{s}) d x d z_{s} c^{'} \sum c \sum \int p (c^{'}) p (c ∣ z_{s}) lo g q (c ∣ z_{s}) [\int p (x ∣ c^{'}) q_{ψ} (z_{s} ∣ x) d x] d z_{s}

= = = E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} E_{p (c ∣ z_{s})} lo g q (c ∣ z_{s}) E_{p (c^{'})} E_{p (x ∣ c^{'})} E_{q_{ψ} (z_{s} ∣ x)} E_{p (c ∣ z_{s})} lo g q (c ∣ z_{s}) c^{'} \sum c \sum \iint p (c^{'}) p (x ∣ c^{'}) q_{ψ} (z_{s} ∣ x) p (c ∣ z_{s}) lo g q (c ∣ z_{s}) d x d z_{s} c^{'} \sum c \sum \int p (c^{'}) p (c ∣ z_{s}) lo g q (c ∣ z_{s}) [\int p (x ∣ c^{'}) q_{ψ} (z_{s} ∣ x) d x] d z_{s}

\int p (x ∣ c^{'}) q_{ψ} (z_{s} ∣ x) d x = \int p (z_{s}, x ∣ c^{'}) d x = p (z_{s} ∣ c^{'})

\int p (x ∣ c^{'}) q_{ψ} (z_{s} ∣ x) d x = \int p (z_{s}, x ∣ c^{'}) d x = p (z_{s} ∣ c^{'})

= c^{'} \sum c \sum \int p (c^{'}) p (z_{s} ∣ c^{'}) p (c ∣ z_{s}) lo g q (c ∣ z_{s}) d z_{s} c^{'} \sum \int p (c^{'}) p (z_{s} ∣ c^{'}) lo g q (c^{'} ∣ z_{s}) d z_{s}

= c^{'} \sum c \sum \int p (c^{'}) p (z_{s} ∣ c^{'}) p (c ∣ z_{s}) lo g q (c ∣ z_{s}) d z_{s} c^{'} \sum \int p (c^{'}) p (z_{s} ∣ c^{'}) lo g q (c^{'} ∣ z_{s}) d z_{s}

= = c^{'} \sum c \sum \int p (c^{'}) p (c ∣ z_{s}) lo g q (c ∣ z_{s}) [\int p (x ∣ c^{'}) q_{ψ} (z_{s} ∣ x) d x] d z_{s} c^{'} \sum c \sum \int p (c^{'}) p (z_{s} ∣ c^{'}) p (c ∣ z_{s}) lo g q (c ∣ z_{s}) d z_{s} c \sum \int p (c) p (z_{s} ∣ c) lo g q (c ∣ z_{s}) d z_{s}

= = c^{'} \sum c \sum \int p (c^{'}) p (c ∣ z_{s}) lo g q (c ∣ z_{s}) [\int p (x ∣ c^{'}) q_{ψ} (z_{s} ∣ x) d x] d z_{s} c^{'} \sum c \sum \int p (c^{'}) p (z_{s} ∣ c^{'}) p (c ∣ z_{s}) lo g q (c ∣ z_{s}) d z_{s} c \sum \int p (c) p (z_{s} ∣ c) lo g q (c ∣ z_{s}) d z_{s}

= = = c \sum \int p (c) p (z_{s} ∣ c) lo g q (c ∣ z_{s}) d z_{s} c \sum \iint p (c) p (x ∣ c) q_{ψ} (z_{s} ∣ x) lo g q (c ∣ z_{s}) d z_{s} d x c \sum \iint p (x) q_{ψ} (z_{s} ∣ x) p (c ∣ x) lo g q (c ∣ z_{s}) d z_{s} d x E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} c \sum p (c ∣ x) lo g q (c ∣ z_{s})

= = = c \sum \int p (c) p (z_{s} ∣ c) lo g q (c ∣ z_{s}) d z_{s} c \sum \iint p (c) p (x ∣ c) q_{ψ} (z_{s} ∣ x) lo g q (c ∣ z_{s}) d z_{s} d x c \sum \iint p (x) q_{ψ} (z_{s} ∣ x) p (c ∣ x) lo g q (c ∣ z_{s}) d z_{s} d x E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} c \sum p (c ∣ x) lo g q (c ∣ z_{s})

L_{c l s} = - E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} c \sum I (c = y) lo g q (c ∣ z_{s})

L_{c l s} = - E_{p (x)} E_{q_{ψ} (z_{s} ∣ x)} c \sum I (c = y) lo g q (c ∣ z_{s})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ZhilZheng/Lr-LiVAE
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Advanced Image and Video Retrieval Techniques

MethodsUSD Coin Customer Service Number +1-833-534-1729 · Convolution · Dogecoin Customer Service Number +1-833-534-1729

Full text

Disentangling Latent Space for VAE by Label Relevant/Irrelevant Dimensions

Zhilin Zheng1 Li Sun1

1 Shanghai Key Laboratory of Multidimensional Information Processing,

East China Normal University

[email protected] [email protected]

Abstract

VAE requires the standard Gaussian distribution as a prior in the latent space. Since all codes tend to follow the same prior, it often suffers the so-called ”posterior collapse”. To avoid this, this paper introduces the class specific distribution for the latent code. But different from cVAE, we present a method for disentangling the latent space into the label relevant and irrelevant dimensions, $\bm{\mathrm{z}}_{s}$ and $\bm{\mathrm{z}}_{u}$ , for a single input. We apply two separated encoders to map the input into $\bm{\mathrm{z}}_{s}$ and $\bm{\mathrm{z}}_{u}$ respectively, and then give the concatenated code to the decoder to reconstruct the input. The label irrelevant code $\bm{\mathrm{z}}_{u}$ represent the common characteristics of all inputs, hence they are constrained by the standard Gaussian, and their encoder is trained in amortized variational inference way, like VAE. While $\bm{\mathrm{z}}_{s}$ is assumed to follow the Gaussian mixture distribution in which each component corresponds to a particular class. The parameters for the Gaussian components in $\bm{\mathrm{z}}_{s}$ encoder are optimized by the label supervision in a global stochastic way. In theory, we show that our method is actually equivalent to adding a KL divergence term on the joint distribution of $\bm{\mathrm{z}}_{s}$ and the class label $c$ , and it can directly increase the mutual information between $\bm{\mathrm{z}}_{s}$ and the label $c$ . Our model can also be extended to GAN by adding a discriminator in the pixel domain so that it produces high quality and diverse images.

1 Introduction

Learning a deep generative model for the structured image data is difficult because this task is not simply modeling a many-to-one mapping function such as the classification, instead it is often required to generate diverse outputs for similar codes sampled from a simple distribution. Furthermore, image $\bm{\mathrm{x}}$ in the high dimension space often lies in a complex manifold, thus the generative model should capture the underlying data distribution $p(\bm{\mathrm{x}})$ .

Basically, Variational Auto-Encoder (VAE) [34, 20] and Generative Adversarial Network (GAN) [13, 25] are two strategies for structured data generation. In VAE, the encoder $q_{\phi}(\bm{\mathrm{z}}|\bm{\mathrm{x}})$ maps data $\bm{\mathrm{x}}$ into the code $\bm{\mathrm{z}}$ in latent space. The decoder, represented by $p_{\theta}(\bm{\mathrm{x}}|\bm{\mathrm{z}})$ , is given a latent code $\bm{\mathrm{z}}$ sampled from a distribution specified by the encoder and tries to reconstruct $\bm{\mathrm{x}}$ . The encoder and decoder in VAE are trained together mainly based on the data reconstruction loss. At the same time, it requires to regularize the distribution $q_{\phi}(\bm{\mathrm{z}}|\bm{\mathrm{x}})$ to be simple (e.g. Gaussian) based on the Kullback-Leibler (KL) divergence between $q(\bm{\mathrm{z}}|\bm{\mathrm{x}})$ and $p(\bm{\mathrm{z}})=\mathcal{N}(0,\bm{\mathrm{I}})$ , so that the sampling in latent space is easy. Optimization for VAE is quite stable, but results from it are blurry. Mainly because the posterior defined by $q_{\phi}(\bm{\mathrm{z}}|\bm{\mathrm{x}})$ is not complex enough to capture the true posterior, also known for ”posterior collapse”. On the other hand, GAN treats the data generation task as a min/max game between a generator $G(\bm{\mathrm{z}})$ and discriminator $D(\bm{\mathrm{x}})$ . The adversarial loss computed from the discriminator makes generated image more realistic, but its training becomes more unstable. In [10, 22, 28], VAE and GAN are integrated together so that they can benefit each other.

Both VAE and GAN work in an unsupervised way without giving any condition of the label on the generated image. Instead, conditional VAE (cVAE) [39, 3] extends it by showing the label $c$ for both encoder and decoder. It learns data distribution conditioned on the given label. Hence, the encoder and decoder become $q_{\phi}(\bm{\mathrm{z}}|\bm{\mathrm{x}},c)$ and $p_{\theta}(\bm{\mathrm{x}}|\bm{\mathrm{z}},c)$ . Similarly, in conditional GAN (cGAN) [9, 18, 33, 30] label $c$ is given to both generator $G(\bm{\mathrm{z}},c)$ and discriminator $D(\bm{\mathrm{x}},c)$ . Theoretically, feeding label $c$ to either the encoder in VAE or decoder in VAE or GAN helps increasing the mutual information between the generated $\bm{\mathrm{x}}$ and the label $c$ . Thus, it can improve the quality of generated image.

This paper deals with image generation problem in VAE with two separate encoders. For a single input $\bm{\mathrm{x}}$ , our goal is to disentangle the latent space code $\bm{\mathrm{z}}$ , computed by encoders, into the label relevant dimensions $\bm{\mathrm{z}}_{s}$ and irrelevant ones $\bm{\mathrm{z}}_{u}$ . We emphasize the difference between $\bm{\mathrm{z}}_{s}$ and $\bm{\mathrm{z}}_{u}$ , and their corresponding encoders. For $\bm{\mathrm{z}}_{s}$ , since label $c$ is known during training, it should be more accurate and specific. While without any label constraint, $\bm{\mathrm{z}}_{u}$ should be general. Specifically, the two encoders are constrained with different priors on their posterior distributions $q_{\phi_{s}}(\bm{\mathrm{z}}_{s}|\bm{\mathrm{x}})$ and $q_{\phi_{u}}(\bm{\mathrm{z}}_{u}|\bm{\mathrm{x}})$ . Similar with VAE or cVAE, in which the full code $\bm{\mathrm{z}}$ is label irrelevant, the prior for $\bm{\mathrm{z}}_{u}$ is also chosen $\mathcal{N}(0,\bm{\mathrm{I}})$ . But different from previous works, the prior $p(\bm{\mathrm{z}}_{s})$ becomes complex to capture the label relevant distribution. From the decoder’s perspective, it takes the concatenation of $\bm{\mathrm{z}}_{s}$ and $\bm{\mathrm{z}}_{u}$ to reconstruct the input $\bm{\mathrm{x}}$ . Here the distinction with cVAE and cGAN is that they uses the fixed, one-hot encoding label, while our work applies $\bm{\mathrm{z}}_{s}$ , which is considered to be a variational, soft label.

Note that there are two stages for training our model. First, the encoder for $\bm{\mathrm{z}}_{s}$ gets trained for classification task under the supervision of label $c$ . Here instead of the softmax cross entropy loss, Gaussian mixture cross entropy loss proposed in [44] is adopted since it accumulates the mean $\bm{\mathrm{\mu}}_{c}$ and variance $\bm{\mathrm{\sigma}}_{c}$ for samples with the same label $c$ , and models it as the Gaussian $\mathcal{N}(\bm{\mathrm{\mu}}_{c},\bm{\mathrm{\sigma}}_{c})$ , hence $\bm{\mathrm{z}}_{s}\sim\mathcal{N}(\bm{\mathrm{\mu}}_{c},\bm{\mathrm{\sigma}}_{c})$ . The first stage specifies the label relevant distribution. In the second stage, the two encoders and the decoder are trained jointly in an end-to-end manner based on the reconstruction loss. Meanwhile, priors of $\bm{\mathrm{z}}_{s}\sim\mathcal{N}(\bm{\mathrm{\mu}}_{c},\bm{\mathrm{\sigma}}_{c})$ and $\bm{\mathrm{z}}_{u}\sim\mathcal{N}(0,\bm{\mathrm{I}})$ are also considered.

The main contribution of this paper lies in following aspects: (1) for a single input $\bm{\mathrm{x}}$ to the encoder, we provide an algorithm to disentangle the latent space into label relevant and irrelevant dimensions in VAE. Previous works like [15, 4, 37] disentangle the latent space in AE not VAE. So it is impossible to make the inference from their model. Moreover, [27, 4, 23] requires at least two inputs for training. (2) we find the Gaussian mixture loss function is suitable way for estimating the parameters of the prior distribution, and it can be optimized in VAE framework. (3) we give both a theoretical derivation and a variety of detailed experiments to explain the effectiveness of our work.

2 Related works

Two types of methods for the structured image generation are VAE and GAN. VAE [20] is a type of parametric model defined by $p_{\theta}(\bm{\mathrm{x}}|\bm{\mathrm{z}})$ and $q_{\phi}(\bm{\mathrm{z}}|\bm{\mathrm{x}})$ , which employs the idea of variational inference to maximize the evidence lower bound (ELBO), as is shown in (1).

[TABLE]

The right side of the above is the ELBO, which is the lower bound of maximum likelihood. In VAE, a differentiable encoder-decoder are connected, and they are parameterized by $\phi$ and $\theta$ , respectively. $E_{q_{\phi}(\bm{\mathrm{z}}|\bm{\mathrm{x}})}(\log p_{\theta}(\bm{\mathrm{x}}|\bm{\mathrm{z}}))$ represents the end-to-end reconstruction loss, and $\text{KL}(q_{\phi}(\bm{\mathrm{z}}|\bm{\mathrm{x}})||p(\bm{\mathrm{z}}))$ is the KL divergence between the encoder’s output distribution $q_{\phi}(\bm{\mathrm{z}}|\bm{\mathrm{x}})$ and the prior $p(\bm{\mathrm{z}})$ , which is usually modeled by standard normal distribution $\mathcal{N}(0,\bm{\mathrm{I}})$ . Note that VAE assumes that the posterior $q_{\phi}(\bm{\mathrm{z}}|\bm{\mathrm{x}})$ is of Gaussian, and the $\bm{\mathrm{\mu}}$ and $\bm{\mathrm{\sigma}}$ are estimated for every single input $\bm{\mathrm{x}}$ by the encoder. This strategy is named amortized variational inference (AVI), and it is more efficiency than stochastic variational inference (SVI) [17].

VAE’s advantage is that its loss is easy to optimize, but the simple prior in latent space may not capture the complex data patterns which often leads to the mode collapse in latent space. Moreover, VAE’s code is hard to be interpreted. Thus, many works focus on improving VAE on these two aspects. cVAE [39] adds the label vector as the input for both the encoder and decoder, so that the latent code and generated image are conditioned on the label, and potentially prevent the latent collapse. On the other hand, $\beta$ -VAE [16, 7] is a unsupervised approach for the latent space disentanglement. It introduces a simple hyper-parameter $\beta$ to balance the two loss term in (1). A scheme named infinite mixture of VAEs is proposed and applied in semi-supervised generation [1]. It uses multiple number of VAEs and combines them as a non-parametric mixture model. In [19], the semi-amortized VAE is proposed. It combines AVI with SVI in VAE. Here the SVI estimates the distribution parameters on the whole training set, while the AVI in traditional VAE gives this estimation for a single input.

GAN [13] is another technique to model the data distribution $p_{D}(\bm{\mathrm{x}})$ . It starts from a random $\bm{\mathrm{z}}\sim p(\bm{\mathrm{z}})$ , where $p(\bm{\mathrm{z}})$ is simple, e.g. Gaussian, and trains a transform network $g_{\theta}(\bm{\mathrm{z}})$ under the help of discriminator $D_{\phi}(\cdot)$ so that $p_{\theta}(\bm{\mathrm{z}})$ approximates $p_{D}(\bm{\mathrm{x}})$ . The later works [32, 26, 2, 14, 29] try to stabilize GAN’s training. Traditional GAN works in a fully supervised manner, while cGAN [18, 33, 30, 6] aims to generate images conditioned on labels. In cGAN, the label is given as an input to both the generator and discriminator as a condition for the distribution. The encoder-decoder architecture like AE or VAE can also be used in GAN. In ALI [11] and BiGAN [10], the encoder maps $\bm{\mathrm{x}}$ to $\bm{\mathrm{z}}$ , while the decoder reverses it. The discriminator takes the pair of $\bm{\mathrm{z}}$ and $\bm{\mathrm{x}}$ , and is trained to determine whether it comes from the encoder or decoder in an adversarial manner. In VAE-GAN [22, 24], VAE’s generated data are improved by a discriminator. Similar idea also applies to cVAE in [3]. VAE-GAN also applies in some specific applications like [4, 12].

Since code $\bm{\mathrm{z}}$ potentially affects the generated data, some works try to model its effect and disentangle the dimensions of $\bm{\mathrm{z}}$ . InfoGAN [9] reveals the effect of latent space code $c$ by maximizing the mutual information between $c$ and the synthetic data $g_{\theta}(\bm{\mathrm{z,c}})$ . Its generator outputs $g_{\theta}(\bm{\mathrm{z}},c)$ which is inspected by the discriminator $D_{\phi}(\cdot)$ . $D_{\phi}(\cdot)$ also tries to reconstruct the code $c$ . In [27], the latent dimension is disentangled in VAE based on the specified factors and unspecified ones, which is similar with our work. But its encoder takes multiple inputs, and the decoder combines codes from different inputs for reconstruction. The work in [15] modifies [27] by taking a single input. To stabilize training, its model is built in AE not VAE, hence it can’t perform variational inference. Other works in [37, 4, 23] are also built in AE and more than two inputs. Moreover they only apply in a particular domain like face [37, 4] or image-to-image translation [23], while our work is built in VAE and takes only a single input for a more general case.

3 Proposed method

We propose a image generation algorithm based on VAE which divides the encoder into two separate ones, one encoding label relevant representation $\bm{\mathrm{z}}_{s}$ and the other encoding label irrelevant information $\bm{\mathrm{z}}_{u}$ . $\bm{\mathrm{z}}_{s}$ is learned with supervision of the categorical class label and it is required to follow a Gaussian mixture distribution, while $\bm{\mathrm{z}}_{u}$ is wished to contain other common information irrelevant to the label and is made close to standard Gaussian $\mathcal{N}({\bm{0}},{\bm{I}})$ .

3.1 Problem formulation

Given a labeled dataset $\mathcal{D}_{s}=\{({\bf x}^{1},y^{1}),({\bf x}^{2},y^{2}),\cdots,({\bf x}^{(N)},y^{(N)})\}$ , where ${\bf x}^{(i)}$ is the $i$ -th images and $y^{(i)}\in\{0,1,\cdots,C-1\}$ is the corresponding label. $C$ and $N$ are the number of classes and the size of the dataset, respectively. The goal of VAE is to maximum the ELBO defined in (1), so that the data log-likelihood $\log p(\bm{\mathrm{x}})$ is also maximized. The key idea is to split the full latent code ${\bf z}$ into the label relevant dimensions $\bm{\mathrm{z}}_{s}$ and the irrelevant dimensions ${\bf{z}_{u}}$ , which means ${\bm{\mathrm{z}}_{s}}$ fully reflects the class $c$ but $\bm{\mathrm{z}}_{u}$ dose not. Thus the objective can be rewritten as (derived in detail in Appendices).

[TABLE]

In Eq. 2, the ELBO becomes 3 terms in our setting. The first term is the negative reconstruction error, where $p_{\theta}$ is the decoder parameterized by $\theta$ . It measures whether the latent code $\bm{\mathrm{z}}_{s}$ and $\bm{\mathrm{z}}_{u}$ are informative enough to recover the original data. In practice, the reconstruction error $L_{rec}$ can be defined as the $l_{2}$ loss between $\bf x$ and $\bf x^{\prime}$ . The second term acts as a regularization term of label irrelevant branch that pushes $q_{\phi}({\bm{\mathrm{z}}_{u}}|\bm{\mathrm{x}})$ to match the prior distribution $p(\bm{\mathrm{z}}_{u})$ , which is illustrated in detail in Section 3.2. The third term matches $q_{\psi}(\bm{\mathrm{z}}_{s}|\bm{\mathrm{x}})$ to a class-specific Gaussian distribution whose mean and covariance are learned with supervision, and it will be further introduced in Section 3.3.

3.2 Label irrelevant branch

Intuitively, we want to disentangle the latent code $\bm{\mathrm{z}}$ into $\bm{\mathrm{z}}_{s}$ and $\bm{\mathrm{z}}_{u}$ , and expect $\bm{\mathrm{z}}_{u}$ to follow a fixed, prior distribution which is irrelevant to the label. This regularization is realized by minimizing KL divergence between $q_{\phi}({\bm{\mathrm{z}}_{u}}|\bm{\mathrm{x}})$ and the prior $p(\bm{\mathrm{z}}_{u})$ as illustrated in Eq. 3. More specifically, $q_{\phi}(\bm{\mathrm{z}}_{u}|\bm{\mathrm{x}})$ is a Gaussian distribution whose mean ${\bm{\mu}}$ and diagonal covariance ${\bm{\Sigma}}$ are the output of $Encoder^{u}$ parameterized by $\phi$ . $p(\bm{\mathrm{z}}_{u})$ is simply set to $N({\bm{0}},{\bm{I}})$ . Hence the KL regularization term is:

[TABLE]

Note that Eq. 3 can be represented in a closed form, which is easy to be computed.

To ensure good disentanglement in $\bm{\mathrm{z}}_{u}$ and $\bm{\mathrm{z}}_{s}$ , we introduce adversarial learning in the latent space as in AAE [25] to drive the label relevant information out of $\bm{\mathrm{z}}_{u}$ . To do this, an adversarial classifier is added on the top of $\bm{\mathrm{z}}_{u}$ , which is trained to classify the category of $\bm{\mathrm{z}}_{u}$ with cross entropy loss as is shown in (4):

[TABLE]

where $\mathbb{I}(c=y)$ is the indicator function, and $q_{\omega}(c|\bm{\mathrm{z}}_{u})$ is softmax probability output by the adversarial classifier parameterized by $\omega$ . Meanwhile, $Encoder^{u}$ is trained to fool the classifier, hence the target distribution becomes uniform over all categories, which is $\frac{1}{C}$ . The cross entropy loss is defined as (5).

[TABLE]

3.3 Label relevant branch

Inspired by GM loss [44], we expect $\bm{\mathrm{z}}_{s}$ to follow a Gaussian mixture distribution, expressed in Eq. 6, where ${\bm{\mu}_{c}}$ and ${\bm{\Sigma}_{c}}$ are the mean and covariance of Gaussian distribution for class $c$ , and $p(c)$ is the prior probability, which is simply set to $\frac{1}{C}$ for all categories. For simplicity, we ignore the correlation among different dimensions of $\bm{\mathrm{z}}_{s}$ , hence ${\bm{\Sigma}_{c}}$ is assumed to be diagonal.

[TABLE]

Recall that in Eq. 2, the KL divergence between $q_{\psi}(\bm{\mathrm{z}}_{s},c|\bm{\mathrm{x}})$ and $p(\bm{\mathrm{z}}_{s},c)$ is minimized. If ${\bm{\mathrm{z}}_{s}}$ is formulated as a Gaussian distribution with its ${\bm{\Sigma}}\to{\bf 0}$ and its mean $\hat{\bm{\mathrm{z}}}_{s}$ output by $Encoder^{s}$ , which is actually a Dirac delta function $\delta(\bm{\mathrm{z}}_{s}-\hat{\bm{\mathrm{z}}}_{s})$ , the KL divergence turns out to be the likelihood regularization term $L_{lkd}$ in Eq. 7, which is proved in Appendices. Here $\bm{\mu}_{y}$ and $\bm{\Sigma}_{y}$ are the mean and covariance specified by the label $y$ .

[TABLE]

Furthermore, we want $\bm{\mathrm{z}}_{s}$ to contain label information as much as possible, thus the mutual information between $\bm{\mathrm{z}}_{s}$ and class $c$ is added to the maximization objective function. We prove in Appendices that it’s equal to minimize the cross-entropy loss of the posterior probability $q(c|\bm{\mathrm{z}}_{s})$ and the label, which is exactly the classification loss $L_{cls}$ in GM loss as is shown in Eq. 8.

[TABLE]

These two terms are added up to form GM loss in Eq. 9. Here $L_{GM}$ is finally used to train the $Encoder^{s}$ .

[TABLE]

3.4 The decoder and the adversarial discriminator

The latent codes $\bm{\mathrm{z}}_{s}$ and $\bm{\mathrm{z}}_{u}$ output by $Encoder^{s}$ and $Encoder^{u}$ are first concatenated together, and then further given to the decoder to reconstruct the input $\bm{\mathrm{x}}$ by $\bm{\mathrm{x}}^{\prime}$ . Here the $Decoder$ is indicated by $p_{\theta}(\bm{\mathrm{x}}|\bm{\mathrm{z}})$ with its parameter $\theta$ learned from the $l_{2}$ reconstruction error $L_{rec}$ . To synthesize a high quality $\bm{\mathrm{x}}^{\prime}$ , we also employ the adversarial training in the pixel domain. Specifically, a discriminator $D_{\theta_{d}}(\bm{\mathrm{x}},c)$ with adversarial training on its parameter $\theta_{d}$ is used to improve $\bm{\mathrm{x}}^{\prime}$ . Here the label $c$ is utilized in $D_{\theta_{d}}$ like in [30]. The adversarial training loss for discriminator can be formulated as in Eq. 10,

[TABLE]

while this loss becomes

[TABLE]

for the generator. Note that here $G(\bm{\mathrm{z}}_{s},\bm{\mathrm{z}}_{u})$ is the decoder and $p(\bm{\mathrm{z}}_{s})$ is defined in Eq. 6.

3.5 Training algorithm

The training detail is illustrated in Algorithm 1. The $Encoder^{s}$ , modeled by $q_{\psi}$ , extracts label relevant code $\bm{\mathrm{z}}_{s}$ . $Encoder^{s}$ is trained with $L_{GM}$ and $L_{rec}$ , encouraging $\bm{\mathrm{z}}_{s}$ to be label dependent and follow a learned Gaussian mixture distribution. Meanwhile, the $Encoder^{u}$ represented by $q_{\phi}$ is intended to extract class irrelevant code $\bm{\mathrm{z}}_{u}$ . It’s trained by $L_{kl}$ , $L_{E}^{adv}$ and $L_{rec}$ to make ${\bm{\mathrm{z}}_{u}}$ irrelevant to the label and be close to $\mathcal{N}({\bf 0},{\bf I})$ . The adversarial classifier parameterized by $\omega$ is learned to classify ${\bm{\mathrm{z}}_{u}}$ using $L_{C}^{adv}$ . Then the decoder $p_{\theta}$ generates reconstruction image using the combined feature of $\bm{\mathrm{z}}_{s}$ and ${\bm{\mathrm{z}}_{u}}$ with the loss $L_{rec}$ .

In the training process, a 2-stage alternating training algorithm is adopted. First, $Encoder^{s}$ is updated using $L_{GM}$ to learn mean ${\bm{\mu}_{c}}$ and covariance ${\bm{\Sigma}_{c}}$ of the prior $p(\bm{\mathrm{z}}_{s}|c)$ . Then, the two encoders and the decoder are trained jointly to reconstruct images while the distributions of $\bm{\mathrm{z}}_{s}$ and ${\bm{\mathrm{z}}_{u}}$ are considered.

3.6 Application in semi-supervised generation

Given $L$ unlabeled extra data $\mathcal{D}_{u}=\{\bm{\mathrm{x}}^{(N+1)},\bm{\mathrm{x}}^{(N+2)},\cdots,\bm{\mathrm{x}}^{(N+L)}\}$ , we now use our architecture for the semi-supervised generation, in which the labels $y^{(N+i)}$ of $\bm{\mathrm{x}}^{(N+i)}$ in $\mathcal{D}_{u}$ are not presented. Here we hold the assumption that $\mathcal{D}_{u}$ are in the same domain as the fully supervised $\mathcal{D}_{s}$ , but $y^{(N+i)}$ can be satisfied $y^{(N+i)}\in\{0,1,\cdots,C-1\}$ , or out of the predefined range. In other words, if the absent $y^{(N+i)}$ is in the predefined range, its $\bm{\mathrm{z}}_{s}$ follows the same Gaussian mixture distribution as in Eq. 6. Otherwise, $\bm{\mathrm{z}}_{s}$ should follow an ambiguous Gaussian distribution defined in Eq. 11.

[TABLE]

More specifically, $\bm{\mathrm{z}}_{s}$ is expected to follow $\mathcal{N}({\bm{\mu}_{t}},{\bm{\Sigma}_{t}})$ where ${\bm{\mu}_{t}}$ and ${\bm{\Sigma}_{t}}$ are the total mean and covariance of all the class-specific Gaussian distributions $\mathcal{N}({\bm{\mu}_{c}},{\bm{\Sigma}_{c}})$ as illustrated in Eq. 6. Here, ${\bm{\Sigma}_{t}}$ is diagonal matrix with ${\bm{\sigma}_{t}^{2}}$ as its variance vector. ${\bm{\sigma}_{c}^{2}}$ is also the variance vector of ${\bm{\Sigma}_{c}}$ . Hence, the likelihood regularization term becomes $L_{lkd}=-\log\mathcal{N}(\hat{\bm{\mathrm{z}}_{s}};{\bm{\mu}_{t}},{\bm{\Sigma}_{t}})$ . The whole network is trained in a end-to-end manner using total losses. Note that in this setting, the label $y$ is not provided, so $L_{GM}$ , $L^{adv}_{E}$ and $L^{adv}_{C}$ are ignored in the training process.

4 Experiments

In this section, experiments are carried out to validate the effectiveness of the proposed method. A toy example is first designed to show that by disentangling the label relevant and irrelevant codes, our model has the ability of generating diverse data samples than cVAE-GAN [3]. We then compare the quality of generated images on real image datasets. The latent space is also analyzed. Finally, the experiments of semi-supervised generation and image inpainting show the flexibility of our model, hence it may have many potential applications.

4.1 Toy examples

This section demonstrates our method on a toy example, in which the real data distribution lies in 2D with one dimension ( $x$ axis) being label relevant and the other ( $y$ axis) being irrelevant. The distribution is assumed to be known. There are 3 types of data points indicated by green, red and blue, belonging to 3 classes. The 2D data points and their corresponding labels are given to our model for variational inference and the new sample generation.

For comparison, we also give the same training data to cVAE-GAN for the same purpose. The two compared models share the similar settings of the network. In our model, the two encoders are both MLP with 3 hidden layers, and there are 32, 64, and 64 units in them. In cVAE-GAN, the encoder is the same, but it only has one encoder. The discriminators are exactly the same, which is also an MLP of 3 hidden layers with 32, 64, and 64 units. Adam is used as the optimization method in which a fixed learning rate of 0.0005 is applied for both. Each model is trained for 50 epochs until they all converge. The generated samples of each model are plotted in Figure 2.

From Figure 2 we can observe that both two models can capture the underlying data distribution, and our model converges at the similar rate. The advantage of our model is that it tends to generate diverse samples, while cVAE-GAN generates samples in a conserving way in which the label irrelevant dimensions are within the limited value range.

4.2 Analysis on generated image quality

In this section, we compare our method with other generative models for image generation quality. The experiments are conducted on two datasets: FaceScrub [31] and CIFAR-10 [21]. The FaceScrub contains $92k$ training images from $530$ different identities. For FaceScrub, a cascaded object detector proposed in [42] is first used to detect faces first, and then the face alignment is also conducted based on SDM proposed in [46]. The detected cropped faces are resized to the fixed size 64 $\times$ 64. In the training process, Adam optimizer with $\alpha=0.0005$ is used. The hyper parameter $\lambda_{lkd}$ , $\lambda_{kl}$ , and $\lambda_{rec}$ are set to 0.1, $\frac{10}{N_{pixel}}$ , and $\frac{1}{N_{\bm{\mathrm{z}}_{u}}}$ , respectively. Here, $N_{pixel}$ is the number of image pixels, and $N_{\bm{\mathrm{z}}_{u}}$ is the dimension of ${\bm{\mathrm{z}}_{u}}$ . Since our method incorporates the label for training, popular generative networks conditioned on label, like cVAE [39], cVAE-GAN [3], and cGAN [30], are chosen for comparison. For cVAE, cVAE-GAN and cGAN, we randomly generate samples of class $c$ by first sampling ${\bf z}\sim\mathcal{N}({\bf 0},{\bf I})$ and then concatenating ${\bf z}$ and one hot vector of $c$ as the input of decoder/generator. As for ours, $\bm{\mathrm{z}}_{s}\sim\mathcal{N}(\bm{\mathrm{\mu}}_{c},\bm{\mathrm{\sigma}}_{c})$ and $\bm{\mathrm{z}}_{u}\sim\mathcal{N}(\bm{\mathrm{0}},\bm{\mathrm{I}})$ are sampled and combined for decoder to generate samples. Some of generated images are visualized in Figure 9. It shows that samples generated by cVAE are highly blurred, and cGAN suffers from mode collapse. Samples generated by cVAE-GAN and our method seem to have similar quality, we refer to two metrics, $Inception\ Score$ [36] and intra-class diversity [5] to compare them.

We adopt $Inception\ Score$ to evaluate realism and inter-class diversity of images. Generated images that are close to real images of class $y$ should have a posterior probability $p(y|\bm{\mathrm{x}})$ with low entropy. Meanwhile, images of diverse classes should have a marginal probability $p(y)$ with high entropy. Hence, $Inception\ Score$ , formulated as $\exp(\mathbb{E}_{\bm{\mathrm{x}}}KL(p(y|\bm{\mathrm{x}})||p(y)))$ , gets a high value when images are realistic and diverse.

To get conditional class probability $p(y|\bm{\mathrm{x}})$ , we first train a classifier with Inception-ResNet-v1 [40] architecture on real data. Then we randomly generate 53k samples(100 for each class) of FaceScrub and 5k samples (500 for each class) of CIFAR-10, and apply them to the pre-trained classifier. The marginal $p(y)$ is obtained by averaging all $p(y|\bm{\mathrm{x}})$ . The results are listed in Table 4.

We emphasize that our method will generate more diverse samples in one class. Since $Inception\ Score$ only measures inter-class diversity, intra-class diversity of samples should also be taken into account. We adopt the metric proposed in [5], which measures the average negative MS-SSIM [45] between all pairs in the generated image set $\bm{X}$ . Table 2 shows the inter-class diversity of cVAE-GAN and our method on FaceScrub and CIFAR-10.

[TABLE]

4.3 Analysis on disentangled latent space

We now evaluate our proposal on the disentangled latent space, which is represented by label relevant dimensions $\bm{\mathrm{z}}_{s}$ and irrelevant ones $\bm{\mathrm{z}}_{u}$ . $\bm{\mathrm{z}}_{s}$ for class $c$ is supposed to capture the variation unique to training images within the label $c$ , while $\bm{\mathrm{z}}_{u}$ should contain the variation in common characteristics for all classes. It’s validated in the following ways: (1) fixing $\bm{\mathrm{z}}_{u}$ and varying $\bm{\mathrm{z}}_{s}$ . In this setting, we directly sample a $\bm{\mathrm{z}}_{u}\sim\mathcal{N}(\bm{\mathrm{0}},\bm{\mathrm{I}})$ , and keep it fixed. Then a set of $\bm{\mathrm{z}}_{s}$ for class $c$ is obtained by first getting a series of random codes sampled from $\mathcal{N}(\bm{\mathrm{0}},\bm{\mathrm{I}})$ and then mapping them to class $c$ . In specific, we first sample $\bm{\mathrm{z}}_{1}\sim\mathcal{N}(\bm{\mathrm{0}},\bm{\mathrm{I}})$ and $\bm{\mathrm{z}}_{2}\sim\mathcal{N}(\bm{\mathrm{0}},\bm{\mathrm{I}})$ . Then a set of random codes $\bm{\mathrm{z}}^{(i)}$ are obtained by linear interpolation, i.e., $\bm{\mathrm{z}}^{(i)}=\alpha\bm{\mathrm{z}}_{1}+(1-\alpha)\bm{\mathrm{z}}_{2},\alpha\in[0,1]$ . We map each $\bm{\mathrm{z}}^{(i)}$ to class $c$ with $\bm{\mathrm{z}}_{s}^{(i)}=\bm{\mathrm{z}}^{(i)}\odot\bm{\sigma}_{c}+\bm{\mu}_{c}$ . Finally each $\bm{\mathrm{z}}_{s}^{(i)}$ is concatenated with the fixed $\bm{\mathrm{z}}_{u}$ and given to the decoder to get a generated image. (2) fixing $\bm{\mathrm{z}}_{s}$ and varying $\bm{\mathrm{z}}_{u}$ . Similar to (1), we first sample a $\bm{\mathrm{z}}_{s}\sim\mathcal{N}(\bm{{\mu}_{c}},\bm{\sigma}_{c})$ from a learned distribution and keep it fixed. Then a set of label irrelevant $\bm{\mathrm{z}}_{u}$ are obtained by linearly interpolating between $\bm{\mathrm{z}}_{1}$ and $\bm{\mathrm{z}}_{2}$ , where $\bm{\mathrm{z}}_{1}$ and $\bm{\mathrm{z}}_{2}$ are sampled from $\mathcal{N}(\bm{\mathrm{0}},\bm{\mathrm{I}})$ .

We conduct experiments on FaceScrub and the generated images are shown in Figure 4. In Figure 4 (a), each row presents samples generated by linearly transformed $\bm{\mathrm{z}}_{s}$ of a certain class $c$ and a fixed $\bm{\mathrm{z}}_{u}$ . All three rows share the same $\bm{\mathrm{z}}_{u}$ , and each column shares the same random code $\bm{\mathrm{z}}^{(i)}$ and just maps it to different class $c$ with $\bm{\mathrm{z}}_{s}^{(i)}=\bm{\mathrm{z}}^{(i)}\odot\bm{\sigma}_{c}+\bm{\mu}_{c}$ . It shows that as $\bm{\mathrm{z}}_{s}$ varies, one may change differently for different identities, e.g., grow a beard, wrinkle, or take off the make-up. In Figure 4 (b), each row presents samples with linearly transformed $\bm{\mathrm{z}}_{u}$ a fixed $\bm{\mathrm{z}}_{s}$ of class $c$ , and each column shares a same $\bm{\mathrm{z}}_{u}$ . We can see that images from each row change consistently with poses, expressions and illuminations. These two experiments suggest that $\bm{\mathrm{z}}_{s}$ is relevant to $c$ , while $\bm{\mathrm{z}}_{u}$ reflects more common label irrelevant characteristics.

We are also interested in each dimension in $\bm{\mathrm{z}}_{u}$ and conduct an experiment by varying a single element in it. We find three dimensions in $\bm{\mathrm{z}}_{u}$ which reflect the meaningful the common characteristics, such as the expression, elevation and azimuth.

4.4 Semi-supervised image generation

According to the details in Section 3.6, the experiments on semi-supervised image generation are conducted. We find our method can learn well disentangled latent representation when the unlabeled extra data are available. To validate that, we randomly select 200 identities of about 21k images from CASIA [47] dataset and remove their labels to form unlabeled dataset $\mathcal{D}_{u}$ . Note that the identities in $\mathcal{D}_{u}$ are totally different with those in FaceScrub. After training the whole network on labeled dataset $\mathcal{D}_{s}$ , we finetune it on $\mathcal{D}_{u}$ using the training algorithm illustrated in Section 3.6.

To demonstrate the semi-supervised generation results, two different images are given to $Encoder^{S}$ and $Encoder^{U}$ to generate the code $\bm{\mathrm{z}}_{s}$ and $\bm{\mathrm{z}}_{u}$ , respectively. Then, the decoder is required to synthesis a new image based on the concatenated code from $\bm{\mathrm{z}}_{s}$ and $\bm{\mathrm{z}}_{u}$ . The Figure 6 shows face synthesis results using images whose identities have not appeared in $\mathcal{D}_{s}$ . The first row and first column show a set of original images providing $\bm{\mathrm{z}}_{u}$ and $\bm{\mathrm{z}}_{s}$ respectively, while images in the middle are generated ones using $\bm{\mathrm{z}}_{s}$ of the corresponding row and $\bm{\mathrm{z}}_{u}$ of the corresponding column. It is obvious that the identity depends on $\bm{\mathrm{z}}_{s}$ , while other characteristics like the poses, illumination, expressions are reflected on $\bm{\mathrm{z}}_{u}$ . This semi-supervised generation shows $\bm{\mathrm{z}}_{s}$ and $\bm{\mathrm{z}}_{u}$ can also be disentangled on identities outside the labeled training data $\mathcal{D}_{s}$ , which provides the great flexibility for image generation.

4.5 Image inpainting

Our method can also be applied to image inpainting. It means that given a partly corrupted image, we can extract meaningful latent code to reconstruct the original image. Note that in cVAE-GAN [3], an extra class label $c$ should be provided for reconstruction while it’s needless in our method. In practice, we first corrupt some patches for a image $\bm{x}$ , namely right-half, eyes, nose and mouth, and bottom-half regions, then input those corrupted images into the two encoders to get $\bm{\mathrm{z}}_{s}$ and $\bm{\mathrm{z}}_{u}$ , then the reconstructed image $\bm{x^{\prime}}$ is generated using a combined $\bm{\mathrm{z}}_{s}$ and $\bm{\mathrm{z}}_{u}$ . The image inpainting result is obtained by $\bm{x}^{inp}=\bm{M}\odot\bm{x^{\prime}}+(1-\bm{M})\odot\bm{x}$ , where $\bm{M}$ is the binary mask for the corrupted patch. Figure 7 shows the results of image inpainting. cVAE-GAN struggles to complete the images when it comes to a large part of missing regions (e.g. right-half and bottom-half parts) or pivotal regions of faces (e.g. eyes), while our method provides visually pleasing results.

5 Conclusion

We propose a latent space disentangling algorithm on VAE baseline. Our model learns two separated encoders and divides the latent code into label relevant and irrelevant dimensions. Together with a discriminator in pixel domain, we show that our model can generate high quality and diverse images, and it can also be applied in semi-supervised image generation in which unlabeled data with unseen classes are given to the encoders. Future research includes building more interpretable latent dimensions with help of more labels, and reducing the correlation between the label relevant and irrelevant codes in our framework.

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Project 61302125, and in part by Natural Science Foundation of Shanghai under Project 17ZR1408500. Corresponding to [email protected]

Appendix A Mathematical proofs

A.1 The ELBO of the log-likelihood objective

. We declare in Equation 2 that after dividing the latent space into label relevant dimensions $\bm{\mathrm{z}}_{s}$ and label irrelevant dimensions $\bm{\mathrm{z}}_{u}$ , the ELBO of the log-likelihood objective $\log p(\bm{\mathrm{x}})$ becomes 3 terms in our setting.

[TABLE]

Proof

Our generative process is described as follows. First, sample a label relevant code $\bm{\mathrm{z}}_{s}\sim p(\bm{\mathrm{z}}_{s},c)$ and a label irrelevant code $\bm{\mathrm{z}}_{u}\sim p(\bm{\mathrm{z}}_{u})$ . Then, a decoder $p_{\theta}(\bm{\mathrm{x}}|\bm{\mathrm{z}}_{s},\bm{\mathrm{z}}_{u})$ , taking the combination of $\bm{\mathrm{z}}_{s}$ and $\bm{\mathrm{z}}_{u}$ as input, maps latent codes to images. Hence, we factorize the joint distribution $p(\bm{\mathrm{x}},\bm{\mathrm{z}}_{s},\bm{\mathrm{z}}_{u})$ as:

[TABLE]

By using Jensen’s inequality, the log-likelihood $\log p(\bm{\mathrm{x}})$ can be written as:

[TABLE]

A.2 Log-likelihood regularization term in the label relevant branch

Note that the KL divergence $D_{\text{KL}}(q_{\psi}(\bm{\mathrm{z}}_{s},c|\bm{\mathrm{x}})||p(\bm{\mathrm{z}}_{s},c))$ , the third term of the ELBO in Equation 2, is minimized. If we assume conditional independence between $\bm{\mathrm{z}}_{s}$ and the class $c$ , then we have

[TABLE]

where $p(c|x)$ is the one-hot encoding of the label $y$ . If $q_{\psi}(\bm{\mathrm{z}}_{s}|\bm{\mathrm{x}})$ is formulated as Gaussian distribution with ${\bm{\Sigma}}\to{\bf 0}$ and mean $\hat{\bm{\mathrm{z}}}_{s}$ output by $Encoder^{s}$ , which is actually a Dirac delta function.

[TABLE]

The KL regularization term becomes

[TABLE]

The second term relates to the prior distribution, so it can be regraded as a constant. The third term is negative entropy of delta function and has nothing to do with $\hat{\bm{\mathrm{z}}}_{s}$ , hence we consider it as a constant too. Therefore, we have

[TABLE]

where the prior distribution $p(\hat{\bm{\mathrm{z}}}_{s}|c)$ is set to $N(\hat{\bm{\mathrm{z}}}_{s};{\bm{\mu}_{c}},{\bm{\Sigma}_{c}})$ . Ignoring the constant term, it turns out to be the likelihood regularization term $L_{lkd}$ in Equation 7.

[TABLE]

A.3 Cross-entropy objective in the label relevant branch

To encourage $\bm{\mathrm{z}}_{s}$ to become label relevant as much as possible, the mutual information $I(\bm{\mathrm{z}}_{s};c)$ is maximized, where $\bm{\mathrm{z}}_{s}\sim q_{\psi}(\bm{\mathrm{z}}_{s}|{\bf x})$ . In practice, $I(\bm{\mathrm{z}}_{s};c)$ is hard to optimize directly because it requires access to $p(c|\bm{\mathrm{z}}_{s})$ . We can instead optimize its lower bound by introducing an auxiliary distribution $q(c|\bm{\mathrm{z}}_{s})$ to approximate $p(c|\bm{\mathrm{z}}_{s})$ as in infoGAN [9] .

[TABLE]

Since we still need to sample from $p(c|\bm{\mathrm{z}}_{s})$ in the inner expectation, we adopt Lemma 5.1 in infoGAN to further remove the need of $p(c|\bm{\mathrm{z}}_{s})$ . The first term of the lower bound is a constant, so we ignore it. Then the second term becomes

[TABLE]

We hold the assumption that the process of sampling $\bm{\mathrm{z}}_{s}|{\bf x}$ is independent on $c$ , thus

[TABLE]

According to Lemma 5.1 in infoGAN, we have

[TABLE]

Hence

[TABLE]

We further factorize $p(\bm{\mathrm{z}}_{s}|c)$ as $\int p({\bf x}|c)q_{\psi}(\bm{\mathrm{z}}_{s}|{\bf x})d{\bf x}$ , the equation above becomes

[TABLE]

where $p(c|{\bf x})$ is the one-hot encoding of the label $y$ , i.e. $p(c|{\bf x})=\mathbb{I}(c=y)$ . To maximize it is to minimize its opposite, which is exactly the classification loss in Section 3.3.

[TABLE]

Appendix B Experimental details

B.1 Dataset synthesis of toy example

Our synthetic dataset of toy example is a modification of the two-moon dataset, which contains three half circles instead of two. The generative process is described as follows. First, sample data points from three half unit circles with a horizontal interval of 2.2. Then, add Gaussian noises with $std=0.15$ to all of them.

B.2 Network architecture of FaceScrub

For the two encoders, $Encoder^{s}$ and $Encoder^{u}$ , we use VGG [38] architecture with batch normalization layers added to each layer and replace the last three fc layers with two fc layers of 1024 and 512 units. For the decoder, an inverse structure of the encoders is applied. The adversarial classifier in Section 3.2 consists of two fc layers of 256 and 530 units, and the discriminator contains 7 convolution layers and two fc layers (details are shown in Table 4). Note that spectral normalization [29] is applied to to the all of the weights in the discriminator and the label embedding is incorporated in the first fc layer as in [30].

B.3 Network architecture of Cifar-10

The network structures of the two encoders, decoder and discriminator for Cifar-10 are shown in Table 3. The adversarial classifier in the latent space is similar as that used for FaceScrub, which are two fc layers of 256 and 10 units. Also, spectral normalization and label embedding are applied in the discriminator.

B.4 Optimization

We use Adam optimizer with $\alpha=0.0005$ , $\beta_{1}=0$ and $\beta_{2}=0.9$ . Since in the training process, the first stage using $L_{GM}$ is trained 3 times per second stage iteration, $L_{GM}$ converges fast. Continuously training after it converges will cause instability of $L_{GM}$ because ${\bm{\Sigma}_{c}}$ goes down gradually. In practice, we decay the learning rate of ${\bm{\Sigma}_{c}}$ by 0.01 after 2 epochs.

B.5 Inception Score

Recall that $Inception\ Score$ requires access to the conditional class probability $p(y|{\bf x})$ . We use classification model of Inception-ResNet-v1 [40] architecture trained on VGGFace2 [8] to evaluate generative models trained on FaceScrub. For generative models trained on Cifar-10, classification model of Inception-v3 [41] architecture trained on ImageNet [35] is used.

Appendix C Additional experiment results

C.1 More generated samples on FaceScrub and Cifar-10

Figure 8 shows generated samples of our method on FaceScrub and Cifar-10 with each row corresponding to a certain class.

C.2 Additional experiments on CUB-200-2011 and Cifar-100

We additionally apply our method to CUB-200-2011 [43] and Cifar-100 [21] dataset. The CUB-200-2011 contains 200 categories of birds with 11,788 images in total. For CUB-200-2011, we crop the images according to the bounding boxes provided by the dataset and resize the cropped images to 64 $\times$ 64. The network structure is just same as it used in FaceScrub. For Cifar-100, we use the same network as in Cifar-10. Generated images are shown in Figure 9. Results of $Inception\ Score$ and intra-class diversity are listed in Table 5 and Table 6, respectively.

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. E. Abbasnejad, A. Dick, and A. van den Hengel. Infinite variational autoencoder for semi-supervised learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 781–790. IEEE, 2017.
2[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. stat , 1050:9, 2017.
3[3] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Cvae-gan: Fine-grained image generation through asymmetric training. In 2017 IEEE International Conference on Computer Vision (ICCV) , pages 2764–2773. IEEE, 2017.
4[4] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Towards open-set identity preserving face synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6713–6722, 2018.
5[5] M. Ben-Yosef and D. Weinshall. Gaussian mixture generative adversarial networks for diverse datasets, and the unsupervised clustering of images. ar Xiv preprint ar Xiv:1808.10356 , 2018.
6[6] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , volume 1, page 7, 2017.
7[7] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. Understanding disentangling in β 𝛽 \beta -vae. ar Xiv preprint ar Xiv:1804.03599 , 2018.
8[8] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface 2: A dataset for recognising faces across pose and age. In Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on , pages 67–74. IEEE, 2018.