Identity-preserving Face Recovery from Stylized Portraits

Fatemeh Shiri; Xin Yu; Fatih Porikli; Richard Hartley; Piotr Koniusz

arXiv:1904.04241·cs.CV·April 10, 2019

Identity-preserving Face Recovery from Stylized Portraits

Fatemeh Shiri, Xin Yu, Fatih Porikli, Richard Hartley, Piotr Koniusz

PDF

TL;DR

This paper introduces IFRP, a novel method combining style removal and discriminative networks to recover photorealistic, identity-preserving faces from stylized portraits, including paintings and sketches.

Contribution

The paper proposes a new framework that effectively restores realistic faces from stylized images while maintaining identity, outperforming previous methods on various datasets.

Findings

01

Achieves state-of-the-art face recovery results.

02

Successfully handles unaligned stylized portraits and sketches.

03

Recovers high-quality, identity-preserving faces from diverse artistic styles.

Abstract

Given an artistic portrait, recovering the latent photorealistic face that preserves the subject's identity is challenging because the facial details are often distorted or fully lost in artistic portraits. We develop an Identity-preserving Face Recovery from Portraits (IFRP) method that utilizes a Style Removal network (SRN) and a Discriminative Network (DN). Our SRN, composed of an autoencoder with residual block-embedded skip connections, is designed to transfer feature maps of stylized images to the feature maps of the corresponding photorealistic faces. Owing to the Spatial Transformer Network (STN), SRN automatically compensates for misalignments of stylized portraits to output aligned realistic face images. To ensure the identity preservation, we promote the recovered and ground truth faces to share similar visual features via a distance measure which compares features of…

Tables10

Table 1. Table 1: The number of training styles and the corresponding training times.

\pbox3cmNumber of Training Styles	\pbox3cmTraining time per epoch	Seen Styles		Unseen Styles
		SSIM	FSIM	SSIM	FSIM
1 Style	1:49’	0.69	0.72	0.54	0.66
2 Styles	3:54’	0.70	0.77	0.60	0.78
3 Styles	5:20’	0.72	0.88	0.68	0.84
4 Styles	7:05’	0.72	0.88	0.68	0.85
5 Styles	9:47’	0.73	0.88	0.69	0.85

Table 2. Table 3: Comparisons of PSNR, SSIM and FSIM on the entire test dataset.

Methods	Seen Styles			Unseen Styles			Unseen Sketches
	PSNR	SSIM	FSIM	PSNR	SSIM	FSIM	PSNR	SSIM	FSIM
Gatys gatys2016image	20.18	0.57	0.73	20.25	0.57	0.66	19.93	0.55	0.67
Johnson johnson2016perceptual	15.65	0.34	0.68	15.81	0.33	0.70	16.27	0.35	0.68
MGAN li2016precomputed	16.22	0.44	0.64	16.17	0.47	0.60	16.01	0.46	0.61
pix2pix isola2016image	20.82	0.59	0.80	18.90	0.54	0.67	19.01	0.55	0.66
CycleGAN zhu2017unpaired	18.58	0.32	0.69	15.89	0.27	0.64	15.65	0.31	0.65
Shiri Shiri2017FaceD	21.57	0.58	0.79	20.21	0.56	0.70	21.35	0.57	0.71
IFRP	26.08	0.72	0.88	24.83	0.68	0.84	24.89	0.68	0.83

Table 3. Table 4: Comparisons of FRR and FCR on the entire test dataset.

Methods	FRR			FCR
	Seen Styles	Unseen Styles	Unseen Sketch
Gatys gatys2016image	64.67%	62.28%	68.36%	72.89%
Johnson johnson2016perceptual	50.54%	38.87%	40.27%	44.99%
MGAN li2016precomputed	26.97%	22.52%	24.99%	38.24%
pix2pix isola2016image	75.13%	59.98%	66.63%	87.73%
CycleGAN zhu2017unpaired	25.07%	25.68%	26.70%	24.97%
Shiri Shiri2017FaceD	84.51%	75.32%	76.44%	89.09%
IFRP	90.93%	84.92%	89.05%	92.06%

Table 4. Table 5: Quantitative comparisons of the impact of each of our losses.

Loss Function	Seen Styles		Unseen Styles
	SSIM	FSIM	SSIM	FSIM
$ℒ_{pix}$	0.60	0.72	0.54	0.65
$ℒ_{p i x}$ + $ℒ_{dis}$	0.62	0.75	0.58	0.72
IFRP ( $ℒ_{pix}$ + $ℒ_{dis}$ + $ℒ_{i d}$ )	0.72	0.88	0.68	0.84

Table 5. Table 6: Quantitative comparisons of the impact of various IFRP network components.

SRN Architecture	Seen Styles		Unseen Styles
	SSIM	FSIM	SSIM	FSIM
Standard Autoencoder	0.65	0.84	0.62	0.80
U-net Autoencoder	0.65	0.87	0.61	0.78
Top 2-layer skip conn.	0.66	0.86	0.63	0.82
IFRP: 2-layer skip conn.+Res. blocks	0.72	0.88	0.68	0.84

Table 6. Table 7: SSIM as the function of the number of in-plain rotation-based augmentations of SF images used during training.

Rotation Angles (degrees)	Without STNs	With STN
-30, -20, -15, -10, -5, 0, 5, 10, 15, 20, 30	0.64	0.66
-30, -15, 0, 15, 30	0.64	0.65

Table 7. Table 8: The STN1 architecture.

STN1
Input: 64 x 64 x 32
3 x 3 x 64 conv, relu, Max-pooling(2,2)
3 x 3 x 128 conv, relu, Max-pooling(2,2)
3 x 3 x 256 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu
fully connected (80,20), relu
fully connected (20,4)

Table 8. Table 9: The STN2 architecture.

STN2
Input: 32 x 32 x 64
3 x 3 x 128 conv, relu, Max-pooling(2,2)
3 x 3 x 256 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu
fully connected (80,20), relu
fully connected (20,4)

Table 9. Table 10: The STN3 architecture.

STN3
Input: 16 x 16 x 128
3 x 3 x 256 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu
fully connected (80,20), relu
fully connected (20,4)

Table 10. Table 11: The STN4 architecture.

STN4
Input: 32 x 32 x 64
3 x 3 x 64 conv, relu, Max-pooling(2,2)
3 x 3 x 128 conv, relu, Max-pooling(2,2)
3 x 3 x 256 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu
fully connected (80,20), relu
fully connected (20,4)

Equations8

L_{pix} (Θ) = E_{(I_{s}, I_{r}) \sim p (I_{s}, I_{r})} ∥ G_{Θ} (I_{s}) - I_{r} ∥_{F}^{2},

L_{pix} (Θ) = E_{(I_{s}, I_{r}) \sim p (I_{s}, I_{r})} ∥ G_{Θ} (I_{s}) - I_{r} ∥_{F}^{2},

L_{id} (Θ) = E_{(I_{s}, I_{r}) \sim p (I_{s}, I_{r})} ∥ ψ (G_{Θ} (I_{s})) - ψ (I_{r}) ∥_{F}^{2},

L_{id} (Θ) = E_{(I_{s}, I_{r}) \sim p (I_{s}, I_{r})} ∥ ψ (G_{Θ} (I_{s})) - ψ (I_{r}) ∥_{F}^{2},

L_{dis} (Φ) = - E_{I_{r} \sim p (I_{r})} [lo g D_{Φ} (I_{r})] - E_{I_{r} \sim p (I_{r})} [lo g (1 - D_{Φ} (I_{r}))],

L_{dis} (Φ) = - E_{I_{r} \sim p (I_{r})} [lo g D_{Φ} (I_{r})] - E_{I_{r} \sim p (I_{r})} [lo g (1 - D_{Φ} (I_{r}))],

L_{SNR} (Θ) = = + + L_{pix} + λ L_{dis} + η L_{i d} E_{(I_{s}, I_{r}) \sim p (I_{s}, I_{r})} ∥ G_{Θ} (I_{s}) - I_{r} ∥_{F}^{2} λ E_{I_{s} \sim p (I_{s}))} [lo g D_{Φ} (G_{Θ} (I_{s}))] η E_{(I_{s}, I_{r}) \sim p (I_{s}, I_{r})} ∥ ψ (G_{Θ} (I_{s})) - ψ (I_{r}) ∥_{F}^{2},

L_{SNR} (Θ) = = + + L_{pix} + λ L_{dis} + η L_{i d} E_{(I_{s}, I_{r}) \sim p (I_{s}, I_{r})} ∥ G_{Θ} (I_{s}) - I_{r} ∥_{F}^{2} λ E_{I_{s} \sim p (I_{s}))} [lo g D_{Φ} (G_{Θ} (I_{s}))] η E_{(I_{s}, I_{r}) \sim p (I_{s}, I_{r})} ∥ ψ (G_{Θ} (I_{s})) - ψ (I_{r}) ∥_{F}^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Spatial Transformer · Solana Customer Service Number +1-833-534-1729 · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia?

Full text

∎

11institutetext: F. Shiri1, X.Yu1, F. Porikli1, R.Hartley1,2, P. Koniusz2,1

22institutetext: 1-Australian National University, 22email: [email protected] 33institutetext: 2-Data61/CSIRO, 33email: [email protected]

Identity-preserving Face Recovery from Stylized Portraits

Fatemeh Shiri1

Xin Yu1

Fatih Porikli1

Richard Hartley1,2

Piotr Koniusz2,1

(Received: 23.02.2018 / Accepted: 29.01.20191)

Abstract

Given an artistic portrait, recovering the latent photorealistic face that preserves the subject’s identity is challenging because the facial details are often distorted or fully lost in artistic portraits. We develop an Identity-preserving Face Recovery from Portraits (IFRP) method that utilizes a Style Removal network (SRN) and a Discriminative Network (DN). Our SRN, composed of an autoencoder with residual block-embedded skip connections, is designed to transfer feature maps of stylized images to the feature maps of the corresponding photorealistic faces. Owing to the Spatial Transformer Network (STN), SRN automatically compensates for misalignments of stylized portraits to output aligned realistic face images. To ensure the identity preservation, we promote the recovered and ground truth faces to share similar visual features via a distance measure which compares features of recovered and ground truth faces extracted from a pre-trained FaceNet network. DN has multiple convolutional and fully-connected layers, and its role is to enforce recovered faces to be similar to authentic faces. Thus, we can recover high-quality photorealistic faces from unaligned portraits while preserving the identity of the face in an image. By conducting extensive evaluations on a large-scale synthesized dataset and a hand-drawn sketch dataset, we demonstrate that our method achieves superior face recovery and attains state-of-the-art results. In addition, our method can recover photorealistic faces from unseen stylized portraits, artistic paintings, and hand-drawn sketches.

Keywords:

Face Synthesis Image Stylization Face Recovery Destylization Generative Models

1 Introduction

Style transferring methods are powerful tools that can generate portraits in various artistic styles from photorealistic images. Unlike prior research on the image stylization, we address a challenging inverse problem of photorealistic face recovery from stylized portraits which aims at recovering a photorealistic image of face from a given stylized portrait. Latent photorealistic face images recovered from their artistic portraits are interpretable for humans and they may be useful in facial analysis. Since facial details and expressions in stylized portraits often undergo severe distortions and become corrupted by artifacts such as profile edges and color changes e.g., as in Figure 1(b), recovering a photorealistic face image from its stylized counterpart is very challenging. In general, stylized face images contain a variety of facial expressions, facial feature distortions and misalignments. Therefore, landmark detectors often fail to localize facial landmarks accurately as shown in Figure 1(c).

While recovering photorealistic images from portraits is still uncommon in the literature, image stylization methods have been widely studied. With the use of Convolutional Neural Networks (CNN), Gatys et al. gatys2016controlling achieve promising results by transferring different styles of artworks to images via the semantic contents space. Since their method generates the stylized images by iteratively updating the feature maps of CNNs, it is computationally costly. In order to reduce the computational complexity, several feed-forward CNN-based methods have been proposed ulyanov2016texture ; ulyanov2016instance ; johnson2016perceptual ; dumoulin2016 ; li2017diversified ; chen2016fast ; zhang2017multi ; huang2017arbitrary . However, these methods work only with a single style applied during training. Moreover, such methods are insufficient for generating photorealistic face images as they only capture the correlations of feature maps via Gram matrices thus discarding spatial relations pk_tensor ; me_museum ; power_look_cvpr .

In order to capture spatial/localized statistics of a style image, several patch-based methods li2016precomputed ; isola2016image have been developed. However, such methods cannot capture the global appearance of faces, thus failing to generate authentic face images. For instance, patch-based methods li2016precomputed ; isola2016image fail to attain consistency of face colors, as shown in Figure LABEL:fig:cmp2e. Moreover, the state-of-the-art style transfer methods gatys2016controlling ; li2016precomputed ; ulyanov2016texture ; johnson2016perceptual transfer desired styles to images without considering the task of identity preservation. Thus, these methods cannot generate realistically looking faces with preserved identities.

Our first face destylization architecture Shiri2017FaceD uses only a pixel-wise loss in the generative part of the network. Despite being trained on a large-scale dataset, this method fails to recover faces from unaligned portraits under a variety of scales, rotations and viewpoint variations. This journal manuscript is an extension of our second model Shiri2018wacv which introduces the identity-preserving loss into destylization. Our latest model Shiri2019wacv performs an identity-preserving face destylization with the use of attributes which allow to manipulate appearance details such as hair color, facial expressions, etc.

In this paper, we develop a novel end-to-end trainable identity-preserving approach to face recovery that automatically maps the unaligned stylized portraits to aligned photorealistic face images. Our network employs two subnetworks: a generative subnetwork, dubbed Style Removal Network (SRN), and a Discriminative Network (DN). SRN consists of an autoencoder (a downsampling encoder and an upsampling decoder) and Spatial Transfer Networks (STN) jaderberg2015spatial . The encoder extracts facial components from unaligned stylized face images to transfer the extracted feature maps to the domain of photorealistic images. Subsequently, our decoder forms face images. STN layers are used by the encoder and decoder to align stylized faces. Since faces may appear at different orientations, scales and in various poses, the network may not fully capture all this variability if the training data does not account for it. As a result, we would need heavy data augmentation and more training instances with variety of poses in the training dataset to cope with recovery of faces from authentic portraits that may be presented under angle or viewpoint, etc. In contrast to such a costly training, by exploiting STN layers, we require less data to train our network to cope well with images containing face rotations, translations and scale changes. Nonetheless, with or without STN layers, we expose our network during training to images of faces at different scales and rotations to train it how to recover the frontal view. We aim to recover faces in frontal view for visualization purposes (easy to interpret for humans, a face retrieval software works better with frontal views, etc.). The discriminative network, inspired by approaches Goodfellow2014 ; denton2015deep ; yu2016ultra ; yu2017face , forces SRN to generate destylized faces to be similar to authentic ground truth faces.

As we aim to preserve the information about facial identities, we force the CNN feature representations of recovered faces to be as close to the features of ground truth real faces as possible. For this purpose, we employ pixel-level Euclidean and identity-preserving losses. We also use an adversarial loss to achieve high-quality visual results.

To train our network, pairs of Stylized Face (SF) and ground truth Real Face (RF) images are required. Thus, we synthesize a large-scale dataset of SF/RF pairs. As there exist numerous styles to choose from, we cannot generate faces in all possible styles for training. We note that a Gram matrix formed from features of pre-trained VGG network can capture style details of input images gatys2016image . Thus, we measure the similarity of various styles via the Log-Euclidean distance jayasumana2013kernel between Gram matrices of style images and the average Gram matrix of real faces. Based on such a style-distance metric, we select three distinct styles for training.

Moreover, we have observed that CNN filters learned on images of seen styles (used for training) tend to extract meaningful features from images in both seen and unseen styles. Thus, our method can also extract facial information from unseen stylized portraits and generate photorealistic faces, as demonstrated in the experimental section.

Below we list our contributions:

I.

We design a new framework to automatically remove styles from unaligned stylized portraits. Our approach generates facial identities and expressions that match the ground truth face images well (identity preservation). 2. II.

We propose an autoencoder with skip connections between top convolutional and deconvolutional layers; each skip connection being composed of three residual blocks. These skip connections pass high-level visual features of portraits from convolutional to deconvolutional layers, which leads to an improved restoration performance. 3. III.

We add an identity-preserving loss to remove seen/unseen styles from portraits preserve underlying identities. 4. IV.

We use STNs as intermediate layers to learn to align non-aligned input portraits. Thus, our method does not use any facial landmarks or 3D models of faces (typically used for face alignment) and requires somewhat fewer augmentations than a network without STNs. 5. V.

We propose a style-distance metric to capture the most distinct styles for training. Thus, our network achieves a good generalization when tested on unseen styles.

Our large dataset of pairs of stylized and photorealistic faces, and the code will be available on https://github.com/fatimashiri and/or http://claret.wikidot.com.

2 Related Work

In this section, we briefly review neural generative models and deep style transfer methods for image generation.

2.1 Neural Generative Models

There exist many generative models for the problem of image generation oord2016pixel ; kingma2013auto ; oord2016pixel ; Goodfellow2014 ; denton2015deep ; zhang2017image ; Shiri2017FaceD . Among them, GANs are conceptually closely related to our problem as they employ an adversarial loss that forces the generated images to be as photorealistic as the ground truth images.

Several methods for super-resolution ledig2016photo ; yu2017face ; huang2017beyond ; yu2017hallucinating ; yu2016ultra and inpainting pathak2016context adopt an adversarial training to learn a parametric translating function from a large-scale dataset of input-output pairs. These approaches often use the $\ell_{1}$ or $\ell_{2}$ norm and adversarial losses to compare the generated image to the corresponding ground truth image. Although these methods produce impressive photorealistic images, they fail to preserve identities of subjects.

Conditional GANs have been used for the task of generating photographs from semantic layout/scene attributes karacan2016learning and sketches sangkloy2016scribbler . Li and Wand li2016precomputed train a Markovian GAN for the style transfer – a discriminative training is applied on Markovian neural patches to capture local style statistics. Isola et al. isola2016image develop “pix2pix” framework which uses so-called “Unet” architecture and the patch-GAN to transfer low-level features from the input to the output domain. For faces, this approach produces visual artifacts and fails to capture the global appearance of faces.

Patch-based methods fail to capture the global appearance of faces and, as a result, they generate poorly destylized images. In contrast, we propose an identity-preserving loss to faithfully recover the most prominent details of faces.

Moreover, there exist several deep learning methods that synthesize sketches from photographs (and vice versa) nejati2011study ; wang2018back ; wang2018high ; sharma2011bypassing . Wang et al. wang2018back use the vanilla conditional GAN (cGAN) to generate sketches. However, the cGAN produces sketch-like artifacts in the synthesized faces as well as facial deformations. Wang et al. wang2018high use the CycleGAN CycleGAN2017 , and employ multi-scale discriminators to generate high resolution sketches/photos. Their method demonstrates a greatly improved performance. However, it still produces slight blur and/or color degraded artifacts. Kazemi et al. kazemi2018facial employ Cycle-GAN conditioned on facial attributes in order to enforce desired facial attributes over the images synthesized from sketches. While sketch-to-face synthesis is a related problem, our unified framework works well with a variety of styles more complex than sketches.

2.2 Deep Style Transfer

Style transfer is a technique which can render a given content image (input) according to a specific painting style while preserving the visual contents of the input. We distinguish image optimization and feed-forward style transfer methods. The seminal optimization-based work gatys2016image transfers the style of an artistic image to a given photograph. It uses iterative optimization to generate a target image from a random initialization (following the Normal distribution). During the optimization step, the statistics of the feature maps of the target, the content and style images are matched.

Gatys et al. gatys2016image inspired many follow-up studies. Yin yin2016content presents a content-aware style transfer method which initializes the optimization step with a content image instead of a random noise. Li and Wand li2016combining propose a patch-based style transfer method which combines Markov Random Field (MRF) and CNN techniques. Gatys et al. gatys2016preserving transfer the style via linear models and preserve colors of content images by matching color histograms.

Gatys et al. gatys2016controlling decompose styles into perceptual factors and then manipulate them for the style transfer. Selim et al. selim2016painting modify the content loss through a gain map for the transfer of paintings of head. Wilmot et al. wilmot2017stable use histogram-based losses in their objective and build on the Gatys et al.’s algorithm gatys2016image . Although the above optimization-based methods further improve the quality of style transfer, they are computationally expensive due to the iterative optimization procedure, thus limiting their practical use.

To address the poor computational speed, feed-forward methods replace the original on-line iterative optimization step with training a feed-forward neural network off-line and generating stylized images on-line ulyanov2016texture ; johnson2016perceptual ; li2016precomputed .

Johnson et al. johnson2016perceptual train a generative network for a fast style transfer using perceptual loss functions. The architecture of their generator network follows the work of radford2015unsupervised and also uses residual blocks. Texture Network ulyanov2016texture employs a multi-resolution architecture in the generator network. Ulyanov et al. ulyanov2016instance ; ulyanov2017improved replace the spatial batch normalization with the instance normalization to achieve a faster convergence. Wang et al. wang2016multimodal enhance the granularity of the feed-forward style transfer with a multimodal CNN, which performs stylization hierarchically using multiple losses deployed across multiple scales.

These feed-forward methods perform stylization around 1000 $\times$ faster than the optimization-based methods. However, they cannot adapt to arbitrary styles not used during training. In order to synthesize an image according to a new style, the entire network needs retraining. To deal with such a restriction, a number of recent approaches encode multiple styles within a single feed-forward network dumoulin2016 ; chen2016fast ; chen2017stylebank ; li2017diversified .

Dumoulin et al. dumoulin2016 use a so-called conditional instance normalization that learns normalization parameters for each style. Given feature maps of the content and style images, method chen2016fast replaces content features with the closest matching style features patch-by-patch. Chen et al. chen2017stylebank present a network that learns a set of new filters for every new style. Li et al. li2017diversified propose a texture controller which forces the network to synthesize the desired style. We note that the existing feed-forward approaches have to compromise between the generalization li2017diversified ; huang2017arbitrary ; zhang2017multi and quality ulyanov2017improved ; ulyanov2016instance ; gupta2017characterizing .

3 Proposed Method

Below we present an identity-preserving framework that infers a photorealistic face image ${\widehat{\boldsymbol{I}}}_{r}$ from an unaligned stylized face image ${\boldsymbol{I}}_{s}$ .

3.1 Network Architecture

Our network consists of two parts: a Style Removal Network (SRN) and a Discriminative Network (DN). SRN is composed of an autoencoder as well as skip connections with residual blocks. The SRN module extracts residual feature maps from input portraits and then upsamples them. To attain high-quality visual performance, we pass visual information from last few layers of encoder to the corresponding layers of decoder. The role of DN is to promote the recovered face images to be similar to their real counterparts. The general architecture of our IFRP framework is depicted in Figure 2.

**Style Removal Network. ** As the goal of face recovery is to generate a photorealistic destylized image, a generative network should be able to remove various styles of portraits without loosing the identity information. To this end, we propose the SRN block which employs a fully convolutional autoencoder (a downsampling encoder and an upsampling decoder) with skip connections and STN layers. Figure 2 shows the architecture of our SRN block (the blue frame).

The autoencoder learns a deterministic mapping to transform images from the space of portraits into some latent space (via an encoder), and a mapping from the latent space to the space of real faces (via a decoder). In this manner, the encoder extracts high-level features of unaligned stylized faces and transforms them into a feature vectors of some latent real face domain while the decoder synthesizes photorealistic faces from these feature vectors.

Moreover, we symmetrically link convolutional and deconvolutional layers via skip-layer connections long2015fully . These skip connections pass high-resolution visual details of portraits from convolutional to deconvolutional layers, leading to a good quality recovery. In detail, each skip connection comprises three residual blocks. Due to the usage of residual blocks, our network can remove the styles of input portraits and increase the visual quality as shown in Figure 4. In contrast, the same network but without skip connections tends to produce blurry/fuzzy face images as shown in Figure 4. Figure 4 shows that the visual quality improves as components of our architecture are enabled one-by-one.

As input stylized faces are often misaligned due to in-plane rotations, translations and scale changes, we incorporate Spatial Transformer Networks (STNs) jaderberg2015spatial (green blocks in Figure 2) into the SRN. The STN layer can estimate the motion parameters of face images and warp them to the so-called canonical view. Thus, our method does not require the use of facial landmarks or 3D face models (often used for face alignment). Figure 4 shows that these intermediate STN layers help compensate for misalignment of the input portraits (however, their use is discretionary). The architecture of our STN layers is given in the Appendix A.

For appearance similarity between the recovered faces and their RF ground truth counterparts, we exploit a pixel-wise $\ell_{2}$ loss and an identity-preserving loss. The pixel-wise $\ell_{2}$ loss enforces intensity-based similarity between images of recovered faces and their ground truth images. The autoencoder supervised by the $\ell_{2}$ loss tends to produce oversmooth results as shown in Figure 3. For the identity-preserving loss, we use FaceNet schroff2015facenet to extract features from images (see Section 3.2 for more details), and then we compare the Euclidean distance between feature maps of two images. In this way, we encourage the feature similarity between recovered faces and their ground truth counterparts. Without the identity-preserving loss, the network produces random artifacts that resemble facial details, such as wrinkles, as shown in Figure 3.

**Discriminative Network. ** Using only the pixel-wise distance between the recovered faces and their ground truth real counterparts leads to oversmooth results, as shown in Figure 3. To obtain appealing visual results, we introduce a discriminator, which forces recovered faces to reside in the same latent space as real faces. Our proposed DN is composed of convolutional layers and fully connected layers, as illustrated in Figure 2 (the red frame). The discriminative loss, also known as the adversarial loss, penalizes the discrepancy between the distributions of recovered and real faces. This loss is also used to update the parameters of the SRN block (we alternate over updates of the parameters of SRN and DN). Figure 3 shows the impact of the adversarial loss on the final results.

**Identity Preservation. ** With the adversarial loss, the SRN is able to generate high-frequency facial content. However, the results often lack details of identities such as the beard or wrinkles, as illustrated in Figure 3. A possible way to address this issue is to constrain the recovered face images and the ground truth face images to share the same face-related visual features e.g., FaceNet features schroff2015facenet .

3.2 Training Details

To train our IFRP network in an end-to-end fashion, we require a large number of SF/RF training image pairs. For each RF, we synthesize different unaligned SF images according to chosen artistic styles to obtain SF/RF training pairs $({\boldsymbol{I}}_{s},{\boldsymbol{I}}_{r})$ . As described in Section 4, we only use stylized faces from three distinct styles in the training stage.

Motivated by the ideas of Gatys et al. gatys2016image and Johnson et al. johnson2016perceptual , we construct so-called identity-preserving loss. Specifically, we compute the Euclidean distance between the feature maps of the recovered and ground truth images. These feature maps are obtained from the ReLU activations of FaceNet schroff2015facenet .

Our previous work Shiri2017FaceD uses only the Euclidean loss to compare the generated and ground truth images which results in blurry images. In this work, we use the FaceNet network for the identity preservation loss and compare FaceNet to VGG-19 which is pre-trained on the large-scale ImageNet dataset containing objects. In contrast, FaceNet is pre-trained on a large dataset of 200 million face identities and 800 million pairs of face images. Therefore, FaceNet can capture visually meaningful facial features. As shown in Figure 5, with the help of FaceNet, our results achieve higher fidelity and better consistency with respect to the ground truth face images. Figure 5 shows the results for VGG-19.

With FaceNet, we can preserve the identity information by encouraging the feature similarity between the generated and ground truth faces. We combine the pixel-wise loss, the adversarial loss and the identity-preserving loss together as our final loss function to train our network. Figure 3 illustrates that, with the help of the identity-preserving loss, our IFRP network can recover satisfying identity-preserving images. Below we explain each loss individually.

**Pixel-wise Intensity Similarity Loss. ** Our goal is to train our feed-forward SRN to produce an aligned photorealistic face image from any given stylized unaligned portrait. To achieve this, we force the recovered face image ${\widehat{\boldsymbol{I}}}_{r}$ to be similar to its ground truth counterpart ${\boldsymbol{I}}_{r}$ . We denote the output of our SRN as ${\boldsymbol{G}}_{\boldsymbol{\Theta}}({\boldsymbol{I}}_{s})$ . Since the STN layers are interwoven with the layers of our autoencoder, we optimize the parameters of the autoencoder and the STN layers simultaneously. The pixel-wise loss function $\mathcal{L}_{\small{\rm pix}}$ between ${\widehat{\boldsymbol{I}}}_{r}$ and ${\boldsymbol{I}}_{r}$ is expressed as:

[TABLE]

where $p({\boldsymbol{I}}_{s},{\boldsymbol{I}}_{r})$ represents the joint distribution of the SF and RF images in the training dataset, and ${\boldsymbol{\Theta}}$ denotes the parameters of the SRN block.

**Identity-preserving Loss. ** To obtain convincing identity-preserving results, we propose an identity-preserving loss to take the form of the Euclidean distance between the features of recovered face image ${\widehat{\boldsymbol{I}}}_{r}={\boldsymbol{G}}_{{\boldsymbol{\Theta}}}({\boldsymbol{I}}_{s})$ and the ground truth face image ${\boldsymbol{I}}_{r}$ . The identity-preserving loss $\mathcal{L}_{id}$ is given as:

[TABLE]

where ${\boldsymbol{\psi}}(\cdot)$ denotes the extracted feature maps from the layer ReLU3-2 of the FaceNet model with respect to some input image.

**Discriminative Loss. ** Motivated by the idea of Goodfellow2014 ; denton2015deep ; radford2015unsupervised , we aim to make the discriminative network ${\boldsymbol{D}}_{\boldsymbol{\Phi}}$ fail to distinguish recovered face images from ground truth face images. Therefore, the parameters of the discriminator ${\boldsymbol{\Phi}}$ are updated by minimizing $\mathcal{L}_{\rm dis}$ , expressed as:

[TABLE]

where $p({\boldsymbol{I}}_{r})$ and $p({\widehat{\boldsymbol{I}}}_{r})$ indicate the distributions of real and recovered face images, respectively, and ${\boldsymbol{D}}_{\boldsymbol{\Phi}}({\boldsymbol{I}}_{r})$ and ${\boldsymbol{D}}_{\boldsymbol{\Phi}}({\widehat{\boldsymbol{I}}}_{r})$ are the outputs of ${\boldsymbol{D}}_{\boldsymbol{\Phi}}$ for real and recovered face images. The $\mathcal{L}_{\rm dis}$ loss is also backpropagated with respect to the parameters ${\boldsymbol{\Theta}}$ of the SRN block.

Our SNR loss is a weighted sum of three terms: the pixel-wise loss, the adversarial loss, and the identity-preserving loss. The parameters ${\boldsymbol{\Theta}}$ are obtained by minimizing the final objective function of the SRN loss given below:

[TABLE]

where $\lambda$ and $\eta$ are trade-off parameters for the discriminator and the identity-preserving losses, respectively, and $p({\boldsymbol{I}}_{s})$ is the distribution of stylized face images.

Since both ${\boldsymbol{G}}_{{\boldsymbol{\Theta}}}(\cdot)$ and ${\boldsymbol{D}}_{{\boldsymbol{\Phi}}}(\cdot)$ are differentiable functions, the error can be backpropagated w.r.t. ${\boldsymbol{\Theta}}$ and ${\boldsymbol{\Phi}}$ by the use of the Stochastic Gradient Descent (SGD) combined with the Root Mean Square Propagation (RMSprop) Hinton , which helps our algorithm converge faster.

3.3 Implementation Details

The discriminative network $DN$ is only required in the training phase. In the testing phase, we take SP portraits as inputs and feed them to SRN. The outputs of SRN are the recovered photorealistic face images. We employ convolutional layers with kernels of size $4\times 4$ and stride $2$ in the encoder and deconvolutional layers with kernels of size $4\times 4$ and stride $2$ in the decoder. The feature maps in our encoder are passed to the decoder by skip connections. The batch normalization procedure is applied after our convolutional and deconvolutional layers except for the last deconvolutional layer, similar to the models described in Goodfellow2014 ; radford2015unsupervised . For the non-linear activation function, we use the leaky rectifier with piecewise linear units (leakyReLU maas2013rectifier ), for which the weight of negative slope is set to $0.2$ .

Our network is trained with a mini-batch size of 64, the learning rate set to $10^{-3}$ and the decay rate set to $10^{-2}$ . In all our experiments, parameters $\lambda$ and $\eta$ are set to $10^{-2}$ and $10^{-3}$ . As the iterations progress, the images of output faces will be more similar to the ground truth. Hence, we gradually reduce the effect of the discriminative network by decreasing $\lambda$ . Thus, $\lambda^{n}=\max\{\lambda\cdot 0.995^{n},\lambda/2\}$ , where $n$ is the epoch index. The strategy in which we decrease $\lambda$ not only enriches the impact of the pixel-level similarity but also helps preserve the discriminative information in the SRN during training. We also decrease $\eta$ to reduce the impact of the identity-preserving constraint after each iteration. Thus, $\eta^{n}=\max\{\eta\cdot 0.995^{n},\eta/2\}$ .

As our method is of feed-forward nature (no optimization is required at the test time), it takes 8 ms to destylize a 128 $\times$ 128 image.

4 Synthesized Dataset and Preprocessing

To train our IFRP network and avoid overfitting, a large number of SF/RF image pairs are required. To generate a dataset of such pairs, similar to Shiri2017FaceD , we use the Celebrity dataset (CelebA) Liu2015faceattributes . Firstly, we randomly select 110K faces from the CelebA dataset for training and 2K face images for testing. The original size of images is $178\!\times\!218$ pixels. Subsequently, we crop/extract the center of each image and resize it to $128\!\times\!128$ pixels. We use such cropped images as our RF ground truth face images ${\boldsymbol{I}}_{r}$ . Lastly, we apply affine transformations to the aligned ground truth face images to generate in-plane unaligned face images.

Moreover, to synthesize our training dataset, we retrain the real-time style transfer network johnson2016perceptual for different artworks. We use only three distinct styles, Scream, Candy and Mosaic for synthesizing our training dataset. The procedure detailing how we selected these styles is explained in Section LABEL:metric. We also use 2K unaligned ground truth face images to synthesize 20K SF images from ten diverse styles (Scream, Wave, Candy, Feathers, Sketch, Composition VII, Starry night, Udnie, Mosaic and la Muse) as our testing dataset. Note that we also include artistic sketches as an unseen style into our test dataset. Some stylized face images used for training and testing are shown in Figure 6. Lastly, we emphasize that there is no overlap between the training and testing datasets.

Bibliography65

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Archibald prize; art gallery of nsw. https://www.artgallery.nsw.gov.au/prizes/archibald/. https://www.artgallery.nsw.gov.au/prizes/archibald/ (2017)
2(2) Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G.: Stylebank: An explicit representation for neural image style transfer. ar Xiv preprint ar Xiv:1703.09210 (2017)
3(3) Chen, T.Q., Schmidt, M.: Fast patch-based style transfer of arbitrary style. ar Xiv preprint ar Xiv:1612.04337 (2016)
4(4) Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using a laplacian pyramid of adversarial networks. In: NIPS (2015)
5(5) Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. ar Xiv preprint ar Xiv:1610.07629 (2016)
6(6) Gatys, L.A., Bethge, M., Hertzmann, A., Shechtman, E.: Preserving color in neural artistic style transfer. ar Xiv preprint ar Xiv:1606.05897 (2016)
7(7) Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: CVPR (2016)
8(8) Gatys, L.A., Ecker, A.S., Bethge, M., Hertzmann, A., Shechtman, E.: Controlling perceptual factors in neural style transfer. ar Xiv preprint ar Xiv:1611.07865 (2016)