Using Photorealistic Face Synthesis and Domain Adaptation to Improve   Facial Expression Analysis

Behzad Bozorgtabar; Mohammad Saeed Rad; Hazim Kemal Ekenel and; Jean-Philippe Thiran

arXiv:1905.08090·cs.CV·May 21, 2019

Using Photorealistic Face Synthesis and Domain Adaptation to Improve Facial Expression Analysis

Behzad Bozorgtabar, Mohammad Saeed Rad, Hazim Kemal Ekenel and, Jean-Philippe Thiran

PDF

TL;DR

This paper introduces a novel attribute-guided face synthesis method that improves facial expression recognition by bridging the gap between synthetic and real face images through domain adaptation, enhancing accuracy in various datasets.

Contribution

It proposes a new face synthesis model for cross-domain translation and domain adaptation, improving expression recognition performance on real-world and in-the-wild datasets.

Findings

01

Enhanced expression recognition accuracy on multiple datasets.

02

Effective face synthesis that reduces domain discrepancy.

03

Improved performance on in-the-wild driver emotion recognition.

Abstract

Cross-domain synthesizing realistic faces to learn deep models has attracted increasing attention for facial expression analysis as it helps to improve the performance of expression recognition accuracy despite having small number of real training images. However, learning from synthetic face images can be problematic due to the distribution discrepancy between low-quality synthetic images and real face images and may not achieve the desired performance when the learned model applies to real world scenarios. To this end, we propose a new attribute guided face image synthesis to perform a translation between multiple image domains using a single model. In addition, we adopt the proposed model to learn from synthetic faces by matching the feature distributions between different domains while preserving each domain's characteristics. We evaluate the effectiveness of the proposed approach…

Tables4

Table 1. TABLE I : The generator architecture. There are some notations; n y subscript 𝑛 𝑦 n_{y} denotes the the dimension of domain attributes. IN and RB denote instance normalization and residual block, respectively.

Part	Layers	Input Size $\to$ Output Size	Filter Size	Stride	Padding
	Conv+IN+ReLU	$(h, w, 6) \to (h, w, 64)$	$7 \times 7$	1	3
	Conv+IN+ReLU	$(h, w, 64) \to (\frac{h}{2}, \frac{w}{2}, 128)$	$4 \times 4$	2	1
Encoder	Conv+IN+ReLU	$(\frac{h}{2}, \frac{w}{2}, 128) \to (\frac{h}{4}, \frac{w}{4}, 256)$	$4 \times 4$	2	1
	Conv+IN+ReLU	$(\frac{h}{4}, \frac{w}{4}, 256) \to (\frac{h}{8}, \frac{w}{8}, 512)$	$4 \times 4$	2	1
	Conv+IN+ReLU	$(\frac{h}{8}, \frac{w}{8}, 512) \to (\frac{h}{16}, \frac{w}{16}, 1024)$	$4 \times 4$	2	1
	RB:Conv+IN+ReLU	$(\frac{h}{16}, \frac{w}{16}, 1024) \to (\frac{h}{16}, \frac{w}{16}, 1024)$	$3 \times 3$	1	1
	RB:Conv+IN+ReLU	$(\frac{h}{16}, \frac{w}{16}, 1024) \to (\frac{h}{16}, \frac{w}{16}, 1024)$	$3 \times 3$	1	1
Encoder	RB:Conv+IN+ReLU	$(\frac{h}{16}, \frac{w}{16}, 1024) \to (\frac{h}{16}, \frac{w}{16}, 1024)$	$3 \times 3$	1	1
Bottleneck	RB:Conv+IN+ReLU	$(\frac{h}{16}, \frac{w}{16}, 1024) \to (\frac{h}{16}, \frac{w}{16}, 1024)$	$3 \times 3$	1	1
	RB:Conv+IN+ReLU	$(\frac{h}{16}, \frac{w}{16}, 1024) \to (\frac{h}{16}, \frac{w}{16}, 1024)$	$3 \times 3$	1	1
	RB:Conv+IN+ReLU	$(\frac{h}{16}, \frac{w}{16}, 1024) \to (\frac{h}{16}, \frac{w}{16}, 1024)$	$3 \times 3$	1	1
	Sub-Pixel Conv+IN+ReLU	$(\frac{h}{16}, \frac{w}{16}, 1024 + n_{y}) \to (\frac{h}{8}, \frac{w}{8}, 512)$	$3 \times 3$	2	1
	Sub-Pixel Conv+IN+ReLU	$(\frac{h}{8}, \frac{w}{8}, 512) \to (\frac{h}{4}, \frac{w}{4}, 256)$	$3 \times 3$	2	1
Decoder	Sub-Pixel Conv+IN+ReLU	$(\frac{h}{4}, \frac{w}{4}, 256) \to (\frac{h}{2}, \frac{w}{2}, 128)$	$3 \times 3$	2	1
	Sub-Pixel Conv+IN+ReLU	$(\frac{h}{2}, \frac{w}{2}, 128) \to (h, w, 64)$	$3 \times 3$	2	1
	Image output:Conv+Tanh	$(h, w, 64) \to (h, w, 3)$	$7 \times 7$	1	3
	Side output:Conv+Tanh	$(h, w, 64) \to (h, w, 3)$	$7 \times 7$	1	3

Table 2. TABLE II : The discriminator architecture. FC and m denote fully connected layer and the number of target attributes, respectively

Part	Layers	Input Size $\to$ Output Size	Filter Size	Stride	Padding
	Conv+Leaky ReLU	$(h, w, 6) \to (\frac{h}{2}, \frac{w}{2}, 64)$	$4 \times 4$	2	1
	Conv+Leaky ReLU	$(\frac{h}{2}, \frac{w}{2}, 64) \to (\frac{h}{4}, \frac{w}{4}, 128)$	$4 \times 4$	2	1
Discriminator	Conv+Leaky ReLU	$(\frac{h}{4}, \frac{w}{4}, 128) \to (\frac{h}{8}, \frac{w}{8}, 256)$	$4 \times 4$	2	1
Hidden Layers	Conv+Leaky ReLU	$(\frac{h}{8}, \frac{w}{8}, 256) \to (\frac{h}{16}, \frac{w}{16}, 512)$	$4 \times 4$	2	1
	Conv+Leaky ReLU	$(\frac{h}{16}, \frac{w}{16}, 512) \to (\frac{h}{32}, \frac{w}{32}, 1024)$	$4 \times 4$	2	1
	Conv+Leaky ReLU	$(\frac{h}{32}, \frac{w}{32}, 1024) \to (\frac{h}{64}, \frac{w}{64}, 2048)$	$4 \times 4$	2	1
Outputs	Output Layer:Conv	$(\frac{h}{64}, \frac{w}{64}, 2048) \to (\frac{h}{64}, \frac{w}{64}, 1)$	$3 \times 3$	1	1
	Output Layer:FC	$(\frac{h}{64}, \frac{w}{64}, 2048) \to F C m$	$-$	$-$	$-$

Table 3. TABLE III : Performance comparison on the MUG dataset.

Method	Accuracy
Real Test Images	90.42%
CycleGAN [41]	84.40%
IcGAN [28]	80.32%
Proposed Method	89.91%

Table 4. TABLE V : Recognition accuracies on face images at different pose yaw angles.

Method	$\pm 15$	$\pm 30$	$\pm 45$
Real Profile Images	70.15%	66.50%	58.90%
Synthetic Frontal Face Images	70.91%	65.90%	59.30%
Proposed Method	72.10%	68.52%	63.35%

Equations12

L_{a d v} = E_{x, s} [D_{sr c} (x, s)] - E_{x, s, y} [D_{sr c} (G_{d ec} (G_{e n c} (x, s), y))] - λ_{g p} L_{g p} (D_{sr c}),

L_{a d v} = E_{x, s} [D_{sr c} (x, s)] - E_{x, s, y} [D_{sr c} (G_{d ec} (G_{e n c} (x, s), y))] - λ_{g p} L_{g p} (D_{sr c}),

θ_{d i s} min L_{c l s_{r}} ℓ_{r} (x, s, y^{'}) = E_{x, s, y^{'}} [ℓ_{r} (x, s, y^{'})], = i = 1 \sum m - y_{i}^{'} lo g D_{c l s} (x, s) - (1 - y_{i}^{'}) lo g (1 - D_{c l s} (x, s)),

θ_{d i s} min L_{c l s_{r}} ℓ_{r} (x, s, y^{'}) = E_{x, s, y^{'}} [ℓ_{r} (x, s, y^{'})], = i = 1 \sum m - y_{i}^{'} lo g D_{c l s} (x, s) - (1 - y_{i}^{'}) lo g (1 - D_{c l s} (x, s)),

θ_{e n c}, θ_{d ec} min L_{c l s_{f}} ℓ_{f} (x^{'}, s^{'}, y) = E_{x, s, y^{'}} [ℓ_{f} (x^{'}, s^{'}, y)], = i = 1 \sum m - y_{i} lo g D_{c l s} (x^{'}, s^{'}) - (1 - y_{i}) lo g (1 - D_{c l s} (x^{'}, s^{'})),

θ_{e n c}, θ_{d ec} min L_{c l s_{f}} ℓ_{f} (x^{'}, s^{'}, y) = E_{x, s, y^{'}} [ℓ_{f} (x^{'}, s^{'}, y)], = i = 1 \sum m - y_{i} lo g D_{c l s} (x^{'}, s^{'}) - (1 - y_{i}) lo g (1 - D_{c l s} (x^{'}, s^{'})),

L_{i d} = E_{x, s, y^{'}} [∥ G_{d ec} (G_{e n c} (x, s), y^{'}) - x ∥_{1}],

L_{i d} = E_{x, s, y^{'}} [∥ G_{d ec} (G_{e n c} (x, s), y^{'}) - x ∥_{1}],

L_{bi} = E_{x, s, y^{'}} [∥ x - \overset{x}{^} ∥_{1} + ∥ s - \overset{s}{^} ∥_{1}] + E_{x, s, y} [∥ G_{e n c} (x, s) - G_{e n c} (x^{'}, s^{'}) ∥_{1}], x^{'}, s^{'} = G_{d ec} (G_{e n c} (x, s), y), \overset{x}{^}, \overset{s}{^} = G_{d ec} (G_{e n c} (x^{'}, s^{'}), y^{'}),

L_{bi} = E_{x, s, y^{'}} [∥ x - \overset{x}{^} ∥_{1} + ∥ s - \overset{s}{^} ∥_{1}] + E_{x, s, y} [∥ G_{e n c} (x, s) - G_{e n c} (x^{'}, s^{'}) ∥_{1}], x^{'}, s^{'} = G_{d ec} (G_{e n c} (x, s), y), \overset{x}{^}, \overset{s}{^} = G_{d ec} (G_{e n c} (x^{'}, s^{'}), y^{'}),

L_{G} L_{D} = L_{a d v} + λ_{bi} L_{bi} + λ_{c l s} L_{c l s_{f}} + λ_{i d} L_{i d}, = - L_{a d v} + λ_{c l s} L_{c l s_{r}},

L_{G} L_{D} = L_{a d v} + λ_{bi} L_{bi} + λ_{c l s} L_{c l s_{f}} + λ_{i d} L_{i d}, = - L_{a d v} + λ_{c l s} L_{c l s_{r}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Using Photorealistic Face Synthesis and Domain Adaptation to

Improve Facial Expression Analysis

Behzad Bozorgtabar1, Mohammad Saeed Rad1,

Hazım Kemal Ekenel2, Jean-Philippe Thiran1

1École Polytechnique Fédérale de Lausanne, Switzerland, 2Istanbul Technical University, Istanbul, Turkey

Abstract

Cross-domain synthesizing realistic faces to learn deep models has attracted increasing attention for facial expression analysis as it helps to improve the performance of expression recognition accuracy despite having small number of real training images. However, learning from synthetic face images can be problematic due to the distribution discrepancy between low-quality synthetic images and real face images and may not achieve the desired performance when the learned model applies to real world scenarios. To this end, we propose a new attribute guided face image synthesis to perform a translation between multiple image domains using a single model. In addition, we adopt the proposed model to learn from synthetic faces by matching the feature distributions between different domains while preserving each domain’s characteristics. We evaluate the effectiveness of the proposed approach on several face datasets on generating realistic face images. We demonstrate that the expression recognition performance can be enhanced by benefiting from our face synthesis model. Moreover, we also conduct experiments on a near-infrared dataset containing facial expression videos of drivers to assess the performance using in-the-wild data for driver emotion recognition.

I INTRODUCTION

Face image synthesis has received increasing attentions as it has practical applications in human-computer interactions, facial animation, face recognition, and facial expression recognition. The most notable solution in image synthesis was the incredible breakthroughs in generative models. In particular, Generative Adversarial Network (GAN) [7] variants have achieved state-of-the-art results for image-to-image translation task. These GAN models could discover relations between two visual domain using paired [12] or unpaired data [15, 41] during training process. In addition, most existing GAN models [29, 41] are proposed to synthesize images of a single attribute, which make their training inefficient in the case of having multiple attributes, since for each attribute a separate model is needed.

In this paper, we pursue several objectives; synthesizing realistic faces by controlling the facial attributes of interest (e.g. face expression), preserving the facial identity after manipulation, and to investigate learning from synthetic facial images for improving expression recognition accuracy (see Fig. 1). Our objective is to use a single model to synthesize face photos with a desired attribute and translate an input image from one domain into multiple domains without having matching image pairs. Our proposed method is based on encoder-decoder structure, using the image latent representation, where we model the shared latent representation across image domains. Therefore, during inference step, by changing face attributes, we can generate plausible face images owing attribute of interest. We also introduce bidirectional loss for the latent representation, which can resolve generator mode collapse to ensure diverse outputs can be produced depending on input attribute.

Our paper makes the following contributions:

We extend the previous work [4] and show that how the proposed approach can be used to generate photo-realistic facial images using synthetic face image and unlabeled real face images as the input. We compared our results with SimGAN method [31] in terms of expression recognition accuracy to see improvement in the realism of generated faces using video data recorded in real world conditions; 2. 2.

Compared to other variants of GAN models [41, 28], we demonstrate that the learnt representation achieves high-quality image synthesis results and preserves a certain expression that contribute to improve the performance of expression recognition accuracy to focus on the data augmentation process; 3. 3.

Lastly, unlike existing methods in face expression synthesis, which have only been validated on the face datasets captured in a lab-controlled environment, we tested our approach on the videos in the wild dataset, which contains arbitrary face poses, illumination and self-occlusions.

II Related work

GAN based models [7] have achieved impressive results in many image synthesis applications, including image super-resolution [21], image-to-image translation (pix2pix) [12] and face image editing [25, 35]. We summarize contributions of few important related works using GANs in the following subsections:

II-A Image-to-Image Translation

Many of existing image-to-image translation methods e.g. [12, 31] formulated GANs in the supervised setting, where example image pairs are available. However, collecting paired training data can be difficult. On the other side, there are other GAN based methods, which do not require matching pairs of samples. For example, CycleGAN [41] is capable to learn transformations from source to target domain without one-to-one mapping between two domain’s training data. However, these GAN based methods could only train one specific model for each pair of image domains. Unlike the aforementioned approaches, we use a single model to synthesize multiple photo-realistic images, each having specific attribute. Recently, to manipulate attributes of image during image synthesis, conditional information, such as image labels, is used. As examples, IcGAN [28] and StarGAN [5] proposed image editing using AC-GAN [26] with conditional information. However, we use domain adaptation by adding the realism to the simulated faces and there is no such a solution in these methods. Similar to [28], Fader Networks [19] proposed image synthesis model without needing to apply a GAN to the decoder output. However, these methods impose constraints on image latent space to enforce it to be independent from the attributes of interest, which may result in loss of information in generating attribute guided images.

II-B Face Attribute Manipulation and Face Generation

Li et al. [22] proposed a Deep convolutional network model for Identity-Aware Transfer (DIAT) of the facial attributes. The work [29] and [14] proposed to edit only single facial attribute. Lu et al. [24] proposed attribute-guided face generation to translate low-resolution face images to high-resolution face images. Zhang et al. [38] introduced the spatial attention mechanism into GAN framework, to only alter the attribute-specific face region and keep the rest unchanged. Huang et al. [11] proposed a Two-Pathway Generative Adversarial Network (TP-GAN) for photorealistic face synthesis by simultaneously considering local face details and global structures.

II-C Expression Transfer and Face Frontalization

Zhang et al. [37] proposed a method by disentangling the attributes (expression and pose) for simultaneous pose-invariant facial expression recognition and face images synthesis. Instead, we seek to learn attribute-invariant information in the latent space by imposing auxiliary classifier to classify the generated images. Zhou et al. [40] introduced a conditional difference adversarial autoencoder (CDAAE) to use emotion states as a conditional attribute for face generation. Lai et al. [18] proposed a multi-task GAN-based network that learns to synthesize the frontal face images from profile face images. However, they require paired training data of frontal and profile faces. Instead, we seek to add realism to the synthetic frontal face images without requiring real frontal face images during training. Our method could generate realistic frontal face images using synthetic faces and real faces with arbitrary poses as input.

III Methods

We first introduce our proposed attribute guided face synthesis model in Section III-A. Then, we explore learning from simulated face images through adversarial training in Section III-B. Finally, we discuss our implementation details and experimental results in Sections IV and V, respectively.

III-A Attribute-guided face image synthesis

As the input of our framework, we have input face image, and the attributes to be edited (e.g. facial expression) and side image, which provides additional information to guide photo-realistic face synthesis. Let $\mathcal{X}$ and $\mathcal{S}$ denote original image and side conditional image domains, respectively and $\mathcal{Y}$ set of possible facial attributes. As the training set, we have $m$ triple inputs $\left(x_{i}\in\mathcal{X},s_{i}\in\mathcal{S},y_{i}\in\mathcal{Y}\right)$ , where $x_{i}$ and $y_{i}$ are the $i^{th}$ input face image and binary attribute, respectively and $s_{i}$ represents the $i^{th}$ conditional side image. Then, for any categorical attribute vector $y$ from the set of possible facial attributes $\mathcal{Y}$ , the objective is to train a model that will generate photo-realistic version ( ${x}^{\prime}$ or ${s}^{\prime}$ ) of the inputs ( $x$ and $s$ ) from image domains $\mathcal{X}$ and $\mathcal{S}$ with desired attributes $y$ .

Our model is based on the encoder-decoder architecture with domain adversarial training. As the input to our expression synthesis model (see Fig. 2), we propose to incorporate individual-specific facial shape model (face landmark heatmap) as the side conditional information $s$ in addition to the original input image $x$ . The face landmark heatmap contains 2D Gaussians centered at the landmarks locations, which are then simply concatenated channel-wise with the input image. Our goal is then to train a single generator $G$ with the encoder $G_{enc}$ – decoder $G_{dec}$ networks to translate the input pair $\left(x,s\right)$ from source domains into their corresponding output images $\left({x}^{\prime},{s}^{\prime}\right)$ in the target domain conditioned on the target domain attribute $y$ and the inputs latent representation $G_{enc}\left(x,s\right)$ , $G_{dec}\left(G_{enc}\left(x,s\right),y\right)\rightarrow{x}^{\prime},{s}^{\prime}$ . The encoder $G_{enc}:\left(\mathcal{X}^{source},\mathcal{S}^{source}\right)\rightarrow\mathbb{R}^{n\times\frac{h}{16}\times\frac{w}{16}}$ is a fully convolutional neural network with parameters $\theta_{enc}$ that encodes the input images into a low-dimensional feature space $G_{enc}\left(x,s\right)$ , where $n,h,w$ are the number of the feature channels and the input images dimensions, respectively. The decoder $G_{dec}:\left(\mathbb{R}^{n\times\frac{h}{16}\times\frac{w}{16}},\mathcal{Y}\right)\rightarrow\mathcal{X}^{target},\mathcal{S}^{target}$ is the sub-pixel [30] convolutional neural network with parameters $\theta_{dec}$ that produce realistic images with target domain attribute $y$ and given the latent representation $G_{enc}\left(x,s\right)$ . The precise architectures of the neural networks are described in Section IV-A. During training, we randomly use a set of target domain attributes $y$ to make the generator more flexible in synthesizing images. In the following, we introduce the objectives for the proposed model optimization.

III-A1 Adversarial Loss

We propose to learn attribute-invariant information in the latent space representing the shared features of the images sampled for different attributes. It means if the original and target domains are semantically similar (e.g. facial images of different expressions), we expect the common features across domains to be captured by the same latent representation. Then, the decoder must use the target attribute to do image-to-image translation from the original domain to the target domain. However, this learning process is unsupervised as for each training image from the source domain, its counterpart image in the target domain with attribute $y$ is unknown. Therefore, we propose to train an additional neural network called the discriminator $D$ (with the parameters $\theta_{dis}$ ) using an adversarial formulation to not only distinguish between real and fake generated images, but also to classify the image to its corresponding attribute categories. We use Wasserstein GAN [8] objective with a gradient penalty loss $\mathcal{L}_{gp}$ [2] formulated as below:

[TABLE]

The term $D_{src}\left(\cdot\right)$ denotes a probability distribution over image sources given by $D$ . The hyper-parameter $\lambda_{gp}$ is used to balance the GAN objective with the gradient penalty. A generator (encoder-decoder networks) used in our model has to play two roles: learning the attribute invariant representation for the input images and is trained to maximally fool the discriminator in a min-max game. On the other hand, the discriminator simultaneously seeks to identify the fake examples for each attribute.

III-A2 Attribute Classification Loss

We use a classifier by returning additional output from the discriminator to perform an auxiliary task of classifying the synthesized and the real facial images into their respective attribute values. An attribute classification loss of real images $\mathcal{L}_{cls_{r}}$ to optimize the discriminator parameters $\theta_{dis}$ is defined as follow:

[TABLE]

Here, ${y}^{\prime}$ denotes original attributes categories for the real images. $\ell_{r}$ is is the summation of binary cross-entropy losses of all attributes. Besides, an attribute classification loss of fake images $\mathcal{L}_{cls_{f}}$ used to optimize the generator parameters $\left(\theta_{enc},\theta_{dec}\right)$ , formulated as follow:

[TABLE]

where ${x}^{\prime}$ and ${s}^{\prime}$ are the generated images and auxiliary outputs, which should correctly own the target domain attributes $y$ . $\ell_{f}$ denotes summing up the cross-entropy losses of all fake images.

III-A3 Identity Loss

Using the identity loss, we aim to preserve the attribute-excluding facial image details such as facial identity before and after image translation. We use a pixel-wise $l_{1}$ loss to enforce the facial details consistency and suppress the blurriness:

[TABLE]

III-A4 Bidirectional Loss

Using adversarial loss alone usually leads to mode collapse, and the generator learns to ignore the attributes and changing $y$ at the test time could not generate diverse facial images. This issue has been observed in various applications of conditional GANs [12, 6] and to our knowledge, there is still no proper approach to deal with this issue. To address this problem, we show that using the trained generator, images of different domains can be translated bidirectionally. We decompose this objective into two terms: a bidirectional loss for the image latent representation, and a bidirectional loss between synthesized images and original input images, respectively. This objective is formulated using $l_{1}$ loss as follow:

[TABLE]

In the above equation, $\hat{x}$ and $\hat{s}$ denote the reconstructed original image and the side conditional image, respectively. Unlike [41], where only the cycle consistency losses are used at the image level, we additionally seek to minimize the reconstruction loss using latent representation.

III-A5 Overall Objective

Finally, the generator $G$ is trained with a linear combination of four loss terms: adversarial loss, attribute classification loss for the fake images, bidirectional loss, and identity loss. Meanwhile, the discriminator $D$ is optimized using an adversarial loss and attribute classification loss for the real images:

[TABLE]

where $\lambda_{bi}$ , $\lambda_{id}$ and $\lambda_{cls}$ are hyper-parameters, which tune the importance of identity loss, bidirectional loss and attribute classification loss, respectively.

III-B Realism Refinement Using Domain Adaptation

In an unconstrained face expression recognition (FER), accuracy will drop significantly for large pose variations. One possible solution would be using simulated faces rendered in frontal view. In particular, we utilize a 3D Morphable Model using bilinear face model [34] to construct a simulated frontal face image. Fig. 3 shows examples of simulated faces. However, learning from synthetic face images can be problematic due to a distribution discrepancy between real and synthetic images. Using proposed attribute guided face synthesis in Section III-A, the model takes simulated frontal face image $x$ and real face image with arbitrary pose $s$ as inputs, and generates photo-realistic version of the synthetic face ${x}^{\prime}$ during the realism refinement. With a side input image as condition, the model has enough information about the appearance of the desired face in advance and we transfer a texture from a given unlabeled real face image with arbitrary pose to a synthetic frontal face (see Fig. 4). Here, the discriminator’s role is to discriminate the realism of single generated image using real profile face images. In addition, using the same discriminator, we can generate images exhibiting arbitrary attributes e.g., different expressions.

We compare the pose-normalized face attribute transfer results of our proposed method with SimGAN method [31] on the BU-3DFE dataset [36]. SimGAN method [31] considers learning from simulated and unsupervised images through adversarial training. Our method differs from SimGAN in following aspects: 1) we aim to synthesize photo-realistic frontal faces by preserving the face pose to address the challenges in unconstrained face expression recognition, whereas SimGAN is designed for simpler scenarios e.g., eye image refinement. 2) Another shortcoming of this method would be to ignore categorical information, which limits its performance. In contrast, our proposed method overcomes this issue by introducing attribute classification loss into our objective function. For a fair comparison with SimGAN method, we add the attribute classification loss by modifying the SimGAN’s discriminator, while keeping the rest of network unchanged. We achieve more visually pleasing results on test data compared to the SimGAN method (see Fig. 6). Our proposed method can preserve the face image content while modifying only the attribute-related part of the images using the latent representation.

IV Implementation Details

All networks are trained using Adam optimizer [16] $\left(\beta_{1}=0.5,\beta_{2}=0.999\right)$ and with a base learning rate of $0.0001$ . We linearly decay learning rate after the first 100 epochs. We use a simple data augmentation with only flipping the images horizontally. The input image size and the batch size are set to $128\times 128$ and 8 for all experiments, respectively. We update the discriminator five times for each generator (encoder-decoder) update. The hyper-parameters in Eq. 6 and Eq. 1 are set as: $\lambda_{bi}=10$ and $\lambda_{id}=10$ , $\lambda_{gp}=10$ and $\lambda_{cls}=1$ , respectively. The whole model is implemented using PyTorch on a single NVIDIA GeForce GTX 1080.

IV-A Network Architecture

Tables I and II demonstrate the detailed network architectures of our proposed attribute-guided face image synthesis model. For the discriminator, we use PatchGAN [12] that penalizes structure at the scale of image patches. Regarding the generator’s decoder, we use sub-pixel convolution instead of transposed convolution followed by instance normalization [3]. Our experiments verify that it works remarkably better than transposed convolution for the face image synthesis.

V Experimental Results

V-A Datasets

Near IR Drivers’ Video Dataset: We introduce the Near IR dataset that contains videos of emotion data captured from 26 subjects driving the cars in the multiple camera setup. This dataset is collected to support drivers by Advanced Driver Assistance Systems (ADAS). The drivers show six basic facial expressions including anger, disgust, fear, happiness, sadness, surprise plus neutral faces. In our experiments, we use frames (peak expressions) of 20 subjects for training and validation, and 6 subjects for the test, respectively.

Oulu-CASIA VIS [39]: This dataset contains 480 sequences (from 80 subjects) of six basic facial expressions under the visible (VIS) normal illumination conditions. We conducted our experiments using subject-independent 10-fold cross-validation strategy.

MUG [1]: The MUG dataset contains image sequences of seven different facial expressions belonging to 86 subjects comprising 51 men and 35 women. The image sequences were captured with a resolution of $896\times 896$ . We used image sequences of 52 subjects and the corresponding annotation, which are available publicly via the internet.

BU-3DFE [36]: The Binghamton University 3D Facial Expression Database (BU-3DFE) [36] contains 3D models from 100 subjects, 56 females and 44 males. The subjects show a neutral face as well as six basic facial expressions and at four different intensity levels. Following the setting in [33] and [37], we used an openGL based tool from the database creators to render multiple views from 3D models in seven pan angles $\left(0^{\circ},\pm 15^{\circ},\pm 30^{\circ},\pm 45^{\circ}\right)$ .

**RaFD [20] **: The Radboud Faces Database (RaFD) contains 4,824 images belonging to 67 participants. Each subject makes eight facial expressions.

V-A1 Qualitative evaluation

From qualitative results in Fig. 5, it is obvious that our facial attribute transfer test results (unseen images during the training step) are more visually pleasing compared to other baselines including IcGAN [28] and CycleGAN [41]. We believe our proposed identity loss helps to preserve the face image details and identity. IcGAN even fails to generate subjects with desired attributes, while our proposed method could learn attribute invariant features applicable to synthesize multiple images with desired attributes.

In addition, to evaluate the proposed realism refinement, the face attribute transfer results of our proposed method have been compared with the SimGAN method [31] on the BU-3DFE dataset [36] (see Fig. 6).

V-A2 User Study

We also evaluate the realism of our results with a user study to compare our model with CycleGAN [41]. We asked 15 subjects to select results that are more realistic and facial expression is well distinguishable through pairwise comparisons. In addition, a third choice as “None” was also introduced in the case if none of them could generate realistic result. 16 random images with the corresponding emotion transfer results from RaFD [20] dataset were presented in a randomized fashion to each person. The Pie chart shown in Fig. 7 illustrates that the results reconstructed by our approach are more appealing to the users.

V-A3 Quantitative Evaluation

We quantitatively demonstrate the usefulness our proposed model in synthesizing photo-realistic facial images controlled by the expression category. Doing so, we augment real images from the Oulu-CASIA VIS dataset with the synthetic expression images generated by our model as well as its variants and then compare with other methods to train an expression classifier. The purpose of this experiment is to introduce more variability and enrich the dataset further to improve the expression recognition performance. In particular, from each of the six expression category, we generate 0.5K, 1K, 2K, 5K and 10K images, respectively. The classifier has an identical network architecture used in synthesizing (RaFD) [20] images except the number of neurons used in the discriminator’s fully connected layer. The accuracy results for the expression recognition are shown in Fig. 8. We can observe that when the number of synthetic images is increased to 30K, the accuracy is improved drastically, reaching to 86.95%. The performance starts to become saturated when more images (60K) are used. We achieved a higher recognition accuracy value using the images generated from our method than the state-of-the-arts including CNN-based methods e.g., DTAGN [13]. This suggests that our model has learned to generate more diverse realistic images. In addition, we evaluate the sensitivity of the results for different components of our proposed method (bidirectional loss and side conditional image, respectively).

Moreover, we evaluate the performance of our proposed method on the MUG facial expression dataset, [1] using the video frames of the peak expressions. In Table IV, we report the results of average accuracy of a facial expression on synthesized images. We trained a facial expression classifier with $\left(90\%/10\%\right)$ splitting for training and test sets using a ResNet-50 [10], resulting in a high accuracy of $90.42\%$ . We then trained each of baseline models including CycleGAN and IcGAN using the same training set and performed image-to-image translation on the same test set. Finally, we classified the expression of these generated images using the above-mentioned classifier. As can be seen in Table IV, our model achieves the highest classification accuracy (close to real image), demonstrating that our model could generate the most realistic expressions among all the methods compared.

For the near IR drivers’ dataset, we conducted two set of experiments. In the first experiment, we trained facial expression classifiers with subject-independent subsets (20 subjects for training and validation and 6 subjects for the test). We used multi-view convolutional neural network (MVCNN) [32] as our baseline. The VGG-Face model [27] is used as the bottleneck network. In the second experiment setup, we utilized face frontalization scheme and added realisms to the simulated faces using our proposed approach and [31], respectively. As can be seen in Table IV, our model achieves the highest classification accuracy, demonstrating that our realism refinement facilitates the synthesized images to preserve much detail of face expression.

Finally, as our last experiment, we performed 5-fold cross validation using 100 subjects for the BU-3DFE dataset [36]. Training data includes images of 80 (non-frontal face) subjects, while test data includes images of 20 subjects with varying poses. We use the VGG-Face model [27], which is pretrained on the (RaFD) [20] and then we further fine-tune it on the frontal face images from BU-3DFE dataset. It can be observed from Table V that face frontalization contributes to the expression recognition performance of the profile faces (ranging from 15 to 45 degrees in 15 degrees steps). Having said that, adding realism to the synthetic images (simulated frontal face) helps to bring additional gains in terms of expression recognition accuracy.

VI Conclusion and Future Work

In this paper we propose attribute guided face image synthesis method, which is capable to synthesize photo-realistic face images conditioned on desired attributes. Using our proposed attribute classification objective and incorporating bidirectional learning, we demonstrate a proper way to model latent representation among different domains leading realistic face images as the result. More importantly, we seek to reduce the domain distribution mismatch between synthetic and real faces. In addition, we demonstrate that the synthetic images generated by our method can be used for data augmentation to train face expression classier. We achieve significantly higher average accuracy compared with the state-of-the-art result. In particular, the proposed method surpasses previous approaches by a significant margin of $5.5\%$ on Oulu-CASIA VIS dataset. For the future work, we plan to apply our model to translate dynamic textures of a face from a single image in the context of video domain.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] N. Aifanti, C. Papachristou, and A. Delopoulos. The mug facial expression database. In Image analysis for multimedia interactive services (WIAMIS), 2010 11th international workshop on , pages 1–4. IEEE, 2010.
2[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning , pages 214–223, 2017.
3[3] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450 , 2016.
4[4] B. Bozorgtabar, K. E. H. Rad, Mohammad Saeed, and J.-P. Thiran. Learn to synthesize and synthesize to learn. Submitted to Journal of Computer Vision and Image Understanding (CVIU). Special Issue on Adversarial Learning in Computer Vision , 2018.
5[5] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 8789–8797, 2018.
6[6] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems , pages 658–666, 2016.
7[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems , pages 2672–2680, 2014.
8[8] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems , pages 5769–5779, 2017.