FSGAN: Subject Agnostic Face Swapping and Reenactment

Yuval Nirkin; Yosi Keller; Tal Hassner

arXiv:1908.05932·cs.CV·August 19, 2019

FSGAN: Subject Agnostic Face Swapping and Reenactment

Yuval Nirkin, Yosi Keller, Tal Hassner

PDF

1 Repo

TL;DR

FSGAN is a novel subject-agnostic face swapping and reenactment system that employs advanced neural networks for pose, expression, and occlusion handling, achieving superior results without requiring face-specific training.

Contribution

The paper introduces a subject-agnostic face swapping framework with a new RNN-based reenactment, face completion, and blending networks, including a novel Poisson blending loss.

Findings

01

Outperforms existing face swapping methods in quality.

02

Handles pose, expression, and occlusion variations effectively.

03

Achieves seamless blending with preserved skin tone and lighting.

Abstract

We present Face Swapping GAN (FSGAN) for face swapping and reenactment. Unlike previous work, FSGAN is subject agnostic and can be applied to pairs of faces without requiring training on those faces. To this end, we describe a number of technical contributions. We derive a novel recurrent neural network (RNN)-based approach for face reenactment which adjusts for both pose and expression variations and can be applied to a single image or a video sequence. For video sequences, we introduce continuous interpolation of the face views based on reenactment, Delaunay Triangulation, and barycentric coordinates. Occluded face regions are handled by a face completion network. Finally, we use a face blending network for seamless blending of the two faces while preserving target skin color and lighting conditions. This network uses a novel Poisson blending loss which combines Poisson optimization…

Tables2

Table 1. Table 1 : Quantitative swapping results. On FaceForensics++ videos [ 39 ] .

Method	verification $↓$	SSIM $↑$	euler $↓$	landmarks $↓$
Nirkin et al. [35]	0.39 $\pm$ 0.00	0.49 $\pm$ 0.00	3.15 $\pm$ 0.04	26.5 $\pm$ 17.7
DeepFakes [12]	0.38 $\pm$ 0.00	0.50 $\pm$ 0.00	4.05 $\pm$ 0.04	34.1 $\pm$ 16.6
FSGAN	0.38 $\pm$ 0.00	0.51 $\pm$ 0.00	2.49 $\pm$ 0.04	22.2 $\pm$ 17.7

Table 2. Table 2 : Quantitative ablation results. On FaceForensics++ videos [ 39 ] .

Method	verification $↓$	SSIM $↑$	euler $↓$	landmarks $↓$
FSGAN $(G_{r})$	0.38 $\pm$ 0.00	0.54 $\pm$ 0.00	3.16 $\pm$ 0.03	22.6 $\pm$ 16.5
FSGAN $(G_{r} + G_{c})$	0.38 $\pm$ 0.00	0.54 $\pm$ 0.00	3.21 $\pm$ 0.08	24.5 $\pm$ 17.2
FSGAN $(G_{r} + G_{b})$	0.38 $\pm$ 0.00	0.52 $\pm$ 0.00	2.75 $\pm$ 0.05	23.6 $\pm$ 17.9
FSGAN $(G_{r} + G_{c} + G_{b})$	0.38 $\pm$ 0.00	0.51 $\pm$ 0.00	2.49 $\pm$ 0.04	22.2 $\pm$ 17.7

Equations33

L_{p er c} (x, y) = i = 1 \sum n \frac{1}{C _{i} H _{i} W _{i}} ∥ F_{i} (x) - F_{i} (y) ∥_{1} .

L_{p er c} (x, y) = i = 1 \sum n \frac{1}{C _{i} H _{i} W _{i}} ∥ F_{i} (x) - F_{i} (y) ∥_{1} .

L_{p i x e l} (x, y) = ∥ x - y ∥_{1} .

L_{p i x e l} (x, y) = ∥ x - y ∥_{1} .

L_{r ec} (x, y) = λ_{p er c} L_{p er c} (x, y) + λ_{p i x e l} L_{p i x e l} (x, y) .

L_{r ec} (x, y) = λ_{p er c} L_{p er c} (x, y) + λ_{p i x e l} L_{p i x e l} (x, y) .

L_{a d v} (G, D) = G min D_{1}, \dots D_{n} max i = 1 \sum n L_{G A N} (G, D_{i}),

L_{a d v} (G, D) = G min D_{1}, \dots D_{n} max i = 1 \sum n L_{G A N} (G, D_{i}),

L_{G A N} (G, D) =

L_{G A N} (G, D) =

+ E_{x} [lo g (1 - D (x, G (x)))] .

I_{r_{j}}, S_{r_{j}} = G_{r} (I_{r_{j - 1}}; H (p_{j})),

I_{r_{j}}, S_{r_{j}} = G_{r} (I_{r_{j - 1}}; H (p_{j})),

I_{r_{0}} = I_{s} .

I_{r_{0}} = I_{s} .

L (G_{r}) =

L (G_{r}) =

+ λ_{a d v} L_{a d v} + λ_{se g} L_{p i x e l} (S_{r}, S_{t}) .

L (G_{s}) = L_{ce} + λ_{r ee na c t m e n t} L_{p i x e l} (S_{t}, S_{r}^{t}),

L (G_{s}) = L_{ce} + λ_{r ee na c t m e n t} L_{p i x e l} (S_{t}, S_{r}^{t}),

I_{r} = k = 1 \sum 3 λ_{k} G_{r} (I_{s_{i_{k}}}; H (p_{t})),

I_{r} = k = 1 \sum 3 λ_{k} G_{r} (I_{s_{i_{k}}}; H (p_{t})),

L (G_{c}) = λ_{r ec} L_{r ec} (I_{c}, \tilde{I}_{t}) + λ_{a d v} L_{a d v},

L (G_{c}) = λ_{r ec} L_{r ec} (I_{c}, \tilde{I}_{t}) + λ_{a d v} L_{a d v},

P (I_{t}; I_{r}^{t}; S_{t})) =

P (I_{t}; I_{r}^{t}; S_{t})) =

s.t. f (i, j) = I_{t} (i, j), \forall S_{t} (i, j) = 0,

L (G_{b}) = λ_{r ec} L_{r ec} (G_{b} (I_{t}; I_{r}^{t}; S_{t}), P (I_{t}; I_{r}^{t}; S_{t})) + λ_{a d v} L_{a d v} .

L (G_{b}) = λ_{r ec} L_{r ec} (G_{b} (I_{t}; I_{r}^{t}; S_{t}), P (I_{t}; I_{r}^{t}; S_{t})) + λ_{a d v} L_{a d v} .

G_{r} = G_{c} = Enhancer (Global (2, 2, 3), 2),

G_{r} = G_{c} = Enhancer (Global (2, 2, 3), 2),

G_{b} = Enhancer (Global (1, 1, 1), 1) .

G_{b} = Enhancer (Global (1, 1, 1), 1) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YuvalNirkin/fsgan
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConvolution · Dogecoin Customer Service Number +1-833-534-1729

Full text

FSGAN: Subject Agnostic Face Swapping and Reenactment

Yuval Nirkin

Bar-Ilan University, Israel

[email protected]

Yosi Keller

Bar-Ilan University, Israel

[email protected]

Tal Hassner

The Open University of Israel, Israel

[email protected]

Abstract

We present Face Swapping GAN (FSGAN) for face swapping and reenactment. Unlike previous work, FSGAN is subject agnostic and can be applied to pairs of faces without requiring training on those faces. To this end, we describe a number of technical contributions. We derive a novel recurrent neural network (RNN)–based approach for face reenactment which adjusts for both pose and expression variations and can be applied to a single image or a video sequence. For video sequences, we introduce continuous interpolation of the face views based on reenactment, Delaunay Triangulation, and barycentric coordinates. Occluded face regions are handled by a face completion network. Finally, we use a face blending network for seamless blending of the two faces while preserving target skin color and lighting conditions. This network uses a novel Poisson blending loss which combines Poisson optimization with perceptual loss. We compare our approach to existing state-of-the-art systems and show our results to be both qualitatively and quantitatively superior.

1 Introduction

Face swapping is the task of transferring a face from source to target image, so that it seamlessly replaces a face appearing in the target and produces a realistic result (Fig. LABEL:fig:teaser left). Face reenactment (aka face transfer or puppeteering) uses the facial movements and expression deformations of a control face in one video to guide the motions and deformations of a face appearing in a video or image (Fig. LABEL:fig:teaser right). Both tasks are attracting significant research attention due to their applications in entertainment [1, 21, 48], privacy [6, 26, 32], and training data generation.

Previous work proposed either methods for swapping or for reenactment but rarely both. Earlier methods relied on underlying 3D face representations [46] to transfer or control facial appearances. Face shapes were either estimated from the input image [44, 42, 35] or were fixed [35]. The 3D shape was then aligned with the input images [10] and used as a proxy when transferring intensities (swapping) or controlling facial expression and viewpoints (reenactment).

Recently, deep network–based methods were proposed for face manipulation tasks. Generative adversarial networks (GANs) [13], for example, were shown to successfully generate realistic images of fake faces. Conditional GANs (cGANs) [31, 17, 47] were used to transform an image depicting real data from one domain to another and inspired multiple face reenactment schemes [37, 50, 40]. Finally, the DeepFakes project [12] leveraged cGANs for face swapping in videos, making swapping widely accessible to non-experts and receiving significant public attention. Those methods are capable of generating realistic face images by replacing the classic graphics pipeline. They all, however, still implicitly use 3D face representations.

Some methods relied on latent feature space domain separation [45, 34, 33]. These methods decompose the identity component of the face from the remaining traits, and encode identity as the manifestation of latent feature vectors, resulting in significant information loss and limiting the quality of the synthesized images. Subject specific methods [42, 12, 50, 22] must be trained for each subject or pair of subjects and so require expensive subject specific data—typically thousands of face images—to achieve reasonable results, limiting their potential usage. Finally, a major concern shared by previous face synthesis schemes, particularly the 3D based methods, is that they all require special care when handling partially occluded faces.

We propose a deep learning–based approach to face swapping and reenactment in images and videos. Unlike previous work, our approach is subject agnostic: it can be applied to faces of different subjects without requiring subject specific training. Our Face Swapping GAN (FSGAN) is end-to-end trainable and produces photo realistic, temporally coherent results. We make the following contributions:

•

Subject agnostic swapping and reenactment. To the best of our knowledge, our method is the first to simultaneously manipulate pose, expression, and identity without requiring person-specific or pair-specific training, while producing high quality and temporally coherent results.

•

Multiple view interpolation. We offer a novel scheme for interpolating between multiple views of the same face in a continuous manner based on reenactment, Delaunay Triangulation and barycentric coordinates.

•

New loss functions. We propose two new losses: A stepwise consistency loss, for training face reenactment progressively in small steps, and a Poisson blending loss, to train the face blending network to seamlessly integrate the source face into its new context.

We test our method extensively, reporting qualitative and quantitative ablation results and comparisons with state of the art. The quality of our results surpasses existing work even without training on subject specific images.

2 Related work

Methods for manipulating the appearances of face images, particularly for face swapping and reenactment, have a long history, going back nearly two decades. These methods were originally proposed due to privacy concerns [6, 26, 32] though they are increasingly used for recreation [21] or entertainment (e.g., [1, 48]).

3D based methods. The earliest swapping methods required manual involvement [6]. An automatic method was proposed a few years later [4]. More recently, Face2Face transferred expressions from source to target face [44]. Transfer is performed by fitting a 3D morphable face model (3DMM) [5, 7, 11] to both faces and then applying the expression components of one face onto the other with care given to interior mouth regions. The reenactement method of Suwajanakorn et al. [42] synthesized the mouth part of the face using a reconstructed 3D model of (former president) Obama, guided by face landmarks, and using a similar strategy for filling the face interior as in Face2Face. The expression of frontal faces was manipulated by Averbuch-Elor et al. [3] by transferring the mouth interior from source to target image using 2D wraps and face landmarks.

Finally, Nirkin et al. [35] proposed a face swapping method, showing that 3D face shape estimation is unnecessary for realistic face swaps. Instead, they used a fixed 3D face shape as the proxy [14, 29]. Like us, they proposed a face segmentation method, though their work was not end-to-end trainable and required special attention to occlusions. We show our results to be superior than theirs.

GAN-based methods. GANs [13] were shown to generate fake images with the same distribution as a target domain. Although successful in generating realistic appearances, training GANs can be unstable and restricts their application to low-resolution images. Subsequent methods, however, improved the stability of the training process [28, 2]. Karras et al. [20] train GANs using a progressive multiscale scheme, from a low to high image resolutions. CycleGAN [52] proposed a cycle consistency loss, allowing training of unsupervised generic transformations between different domains. A cGAN with $L_{1}$ loss was applied by Isola et al. [17] to derive the pix2pix method, and was shown to produce appealing synthesis results for applications such as transforming edges to faces.

Facial manipulation using GANs. Pix2pixHD [47] used GANs for high resolution image-to-image translation by applying a multi-scale cGAN architecture and adding a perceptual loss [18]. GANimation [37] proposed a dual generator cGAN conditioned on emotion action units, that generates an attention map. This map was used to interpolate between the reenacted and original images, to preserve the background. GANnotation [40] proposed deep facial reenactment driven by face landmarks. It generates images progressively using a triple consistency loss: it first frontalizes an image using landmarks then processes the frontal face.

Kim et al. [22] recently proposed a hybrid 3D/deep method. They render a reconstructed 3DMM of a specific subject using a classic graphic pipeline. The rendered image is then processed by a generator network, trained to map synthetic views of each subject to photo-realistic images.

Finally, feature disentanglement was proposed as a means for face manipulation. RSGAN [34] disentangles the latent representations of face and hair whereas FSNet [33] proposed a latent space which separates identity and geometric components, such as facial pose and expression.

3 Face swapping GAN

In this work we introduce the Face Swapping GAN (FSGAN), illustrated in Fig. 1. Let $I_{s}$ be the source and $I_{t}$ the target images of faces $F_{s}\in I_{s}$ and $F_{t}\in I_{t}$ , respectively. We aim to create a new image based on $I_{t}$ , where $F_{t}$ is replaced by $F_{s}$ while retaining the same pose and expression.

FSGAN consists of three main components. The first, detailed in Sec. 3.2 (Fig. 1(a)), consists of a reenactment generator $G_{r}$ and a segmentation CNN $G_{s}$ . $G_{r}$ is given a heatmaps encoding the facial landmarks of $F_{t}$ , and generates the reenacted image ${I}_{r}$ , such that $F_{r}$ depicts $F_{s}$ at the same pose and expression of $F_{t}$ . It also computes $S_{r}$ : the segmentation mask of $F_{r}$ . Component $G_{s}$ computes the face and hair segmentations of $F_{t}$ .

The reenacted image, $I_{r}$ , may contain missing face parts, as illustrated in Fig. 1 and Fig. 1(b). We therefore apply the face inpainting network, $G_{c}$ , detailed in Sec. 3.4 using the segmentation $S_{t}$ , to estimate the missing pixels. The final part of the FSGAN, shown in Fig. 1(c) and Sec. 3.5, is the blending of the completed face $F_{c}$ into the target image $I_{t}$ to derive the final face swapping result.

The architecture of our face segmentation network, $G_{s}$ , is based on U-Net [38], with bilinear interpolation for upsampling. All our other generators— $G_{r}$ , $G_{c}$ , and $G_{b}$ —are based on those used by pix2pixHD [47], with coarse-to-fine generators and multi-scale discriminators. Unlike pix2pixHD, our global generator uses a U-Net architecture with bottleneck blocks [15] instead of simple convolutions and summation instead of concatenation. As with the segmentation network, we use bilinear interpolation for upsampling in both global generator and enhancers. The actual number of layers differs between generators.

Following others [50], training subject agnostic face reenactment is non-trivial and might fail when applied to unseen face images related by large poses. To address this challenge, we propose to break large pose changes into small manageable steps and interpolate between the closest available source images corresponding to a target’s pose. These steps are explained in the following sections.

3.1 Training losses

Domain specific perceptual loss. To capture fine facial details we adopt the perceptual loss [18], widely used in recent work for face synthesis [40], outdoor scenes [47], and super resolution [25]. Perceptual loss uses the feature maps of a pretrained VGG network, comparing high frequency details using a Euclidean distance.

We found it hard to fully capture details inherent to face images, using a network pretrained on a generic dataset such as ImageNet. Instead, our network is trained on the target domain: We therefore train multiple VGG-19 networks [41] for face recognition and face attribute classification. Let $F_{i}\in\mathbb{R}^{C_{i}\times H_{i}\times W_{i}}$ be the feature map of the $i$ -th layer of our network, the perceptual loss is given by

[TABLE]

Reconstruction loss. While the perceptual loss of Eq. (1) captures fine details well, generators trained using only that loss, often produce images with inaccurate colors, corresponding to reconstruction of low frequency image content. We hence also applied a pixelwise $L_{1}$ loss to the generators:

[TABLE]

The overall loss is then given by

[TABLE]

The loss in Eq. (3) was used with all our generators.

Adversarial loss. To further improve the realism of our generated images we use an adversarial objective [47]. We utilized a multi-scale discriminator consisting of multiple discriminators, $D_{1},D_{2},...,D_{n}$ , each one operating on a different image resolution. For a generator $G$ and a multi-scale discriminator $D$ , our adversarial loss is defined by:

[TABLE]

where $\mathcal{L}_{GAN}(G,D)$ is defined as:

[TABLE]

3.2 Face reenactment and segmentation

Given an image $I\in\mathbb{R}^{3\times H\times W}$ and a heatmap representation $H(p)\in\mathbb{R}^{70\times H\times W}$ of facial landmarks, $p\in\mathbb{R}^{70\times 2}$ , we define the face reenactment generator, $G_{r}$ , as the mapping $G_{r}:\left\{\mathbb{R}^{3\times H\times W},\mathbb{R}^{70\times H\times W}\right\}\rightarrow\mathbb{R}^{3\times H\times W}$ .

Let $v_{s},v_{t}\in\mathbb{R}^{70\times 3}$ and $e_{s},e_{t}\in\mathbb{R}^{3}$ , be the 3D landmarks and Euler angles corresponding to $F_{s}$ and $F_{t}$ . We generate intermediate 2D landmark positions $p_{j}$ by interpolating between $e_{s}$ and $e_{t}$ , and the centroids of $v_{s}$ and $v_{t}$ , using intermediate points for which we project $v_{s}$ back to $I_{s}$ . We define the reenactment output recursively for each iteration $1\leq j\leq n$ as

[TABLE]

Similar to others [37], the last layer of the global generator and each of the enhancers in $G_{r}$ is split into two heads: the first produces the reenacted image and the second the segmentation mask. In contrast to binary masks used bu others [37], we consider the face and hair regions separately. The binary mask implicitly learned by the reenactment network captures most of the head including the hair, which we segment separately. Moreover, the additional hair segmentation also improves the accuracy of the face segmentation. The face segmentation generator $G_{s}$ is defined as $G_{r}:\mathbb{R}^{3\times H\times W}\rightarrow\mathbb{R}^{3\times H\times W}$ , where given an RGB image it output a 3-channels segmentation mask encoding the background, face, and hair.

Training. Inspired by the triple consistency loss [40], we propose a stepwise consistency loss. Given an image pair $(I_{s},I_{t})$ of the same subject from a video sequence, let $I_{r_{n}}$ be the reenactment result after $n$ iterations, and $\widetilde{I_{t}},\widetilde{I}_{r_{n}}$ be the same images with their background removed using the segmentation masks $S_{t}$ and $S_{r_{j}}$ , respectively. The stepwise consistency loss is defined as: $\mathcal{L}_{rec}(\widetilde{I}_{r_{n}},\widetilde{I}_{t})$ . The final objective for the $G_{r}$ :

[TABLE]

For the objective of $G_{s}$ we use the standard cross-entropy loss, $L_{ce}$ , with additional guidance from $G_{r}$ :

[TABLE]

where $S_{r}^{t}$ is the segmentation mask result of $G_{r}(I_{t};H(p_{t}))$ and $p_{t}$ is the 2D landmarks corresponding to $I_{t}$ .

We train both $G_{r}$ and $G_{s}$ together, in an interleaved fashion. We start with training $G_{s}$ for one epoch followed by the training of $G_{r}$ for an additional epoch, increasing $\lambda_{reenactment}$ as the training progresses. We have found that training $G_{r}$ and $G_{s}$ together helps filtering noise learned from coarse face and hair segmentation labels.

3.3 Face view interpolation

Standard computer graphics pipelines project textured mesh polygons onto a plane for seamless rendering [16]. We propose a novel, alternative scheme for continuous interpolation between face views. This step is an essential phase of our method, as it allows using the entire source video sequence, without training our model on a particular video frame, making it subject agnostic.

Given a set of source subject images, $\left\{\mathbf{I}_{s_{1}},\dots,\mathbf{I}_{s_{n}}\right\}$ , and Euler angles, $\left\{\mathbf{e}_{1},\dots,\mathbf{e}_{n}\right\}$ , of the corresponding faces $\left\{\mathbf{F}_{s_{1}},\dots,\mathbf{F}_{s_{n}}\right\}$ , we construct the appearance map of the source subject, illustrated in Fig. 2(a). This appearance map embeds head poses in a triangulated plane, allowing head poses to follow continuous paths.

We start by projecting the Euler angles $\left\{\mathbf{e}_{1},\dots,\mathbf{e}_{n}\right\}$ onto a plane by dropping the roll angle. Using a k-d tree data structure [16], we remove points in the angular domain that are too close to each other, prioritizing the points for which the corresponding Euler angles have a roll angle closer to zero. We further remove motion blurred images. Using the remaining points, $\left\{x_{1},\dots,x_{m}\right\}$ , and the four boundary points, $y_{i}\in[-75,75]\times[-75,75]$ , we build a mesh, $M$ , in the angular domain by Delaunay Triangulation.

For a query Euler angle, $e_{t}$ , of a face, $F_{t}$ , and its corresponding projected point, $x_{t}$ , we find the triangle $T\in M$ that contains $x_{t}$ . Let $x_{i_{1}},x_{i_{2}},x_{i_{3}}$ be the vertices of $T$ and $I_{s_{i_{1}}},I_{s_{i_{2}}},I_{s_{i_{3}}}$ be the corresponding face views. We calculate the barycentric coordinates, $\lambda_{1},\lambda_{2},\lambda_{3}$ of $x_{t}$ , with respect to $x_{i_{1}},x_{i_{2}},x_{i_{3}}$ . The interpolation result $I_{r}$ is then

[TABLE]

where $\mathbf{p}_{t}$ are the 2D landmarks of $F_{t}$ . If any vertices of the triangle are boundary points, we exclude them from the interpolation and normalize the weights, $\lambda_{i}$ , to sum to one.

A face view query is illustrated in Fig. 2(b,c). To improve interpolation accuracy, we use a horizontal flip to fill in views when the appearance map is one-sided with respect to the yaw dimension, and generate artificial views using $G_{r}$ when the appearance map is too sparse.

3.4 Face inpainting

Occluded regions in the source face $F_{s}$ cannot be rendered on the target face, $F_{t}$ . Nirkin et al. [35] used the segmentations of $F_{s}$ and $F_{t}$ to remove occluded regions, rendering (swapping) only regions visible in both source and target faces. Large occlusions and different facial textures can cause noticeable artifacts in the resulting images.

To mitigate such problems, we apply a face inpainting generator, $G_{c}$ (Fig. 1(b)). $G_{c}$ renders face image $F_{s}$ such that the resulting face rendering $\tilde{I}_{r}$ covers entire segmentation mask $S_{t}$ (of $F_{t}$ ), thereby resolving such occlusion.

Given the reenactment result, $I_{r}$ , its corresponding segmentation, $S_{r}$ , and the target image with its background removed, $\tilde{I}_{t}$ , all drawn from the same identity, we first augment $S_{r}$ by simulating common face occlusions due to hair, by randomly removing ellipse-shaped parts, in various sizes and aspect ratios from the border of $S_{r}$ . Let $\tilde{I}_{r}$ be $I_{r}$ with its background removed using the augmented version of $S_{r}$ , and $I_{c}$ the completed result from applying $G_{c}$ on $\tilde{I}_{r}$ . We define our inpainting generator loss as

[TABLE]

where $\mathcal{L}_{rec}$ and $\mathcal{L}_{adv}$ are the reconstruction and adversarial losses of Sec. 3.1.

3.5 Face blending

The last step of the proposed face swapping scheme is blending of the completed face $F_{c}$ with its target face $F_{t}$ (Fig. 1(c)). Any blending must account for, among others, different skin tones and lighting conditions. Inspired by previous uses of Poisson blending for inpainting [51] and blending [49], we propose a novel Poisson blending loss.

Let $I_{t}$ be the target image, $I_{r}^{t}$ the image of the reenacted face transferred onto the target image, and $S_{t}$ the segmentation mask marking the transferred pixels. Following [36], we define the Poisson blending optimization as

[TABLE]

where $\nabla\left(\cdot\right)$ is the gradient operator. We combine the Poisson optimization in Eq. (11) with the perceptual loss. The Poisson blending loss is then $\mathcal{L}(G_{b})$

[TABLE]

4 Datasets and training

4.1 Datasets and processing

We use the video sequences of the IJB-C dataset [30] to train our generator, $G_{r}$ , for which we automatically extracted the frames depicting particular subjects. IJB-C contains $\sim$ 11k face videos, of which we used 5,500 which were in high definition. Similar to the frame pruning approach of Sec. 3.3, we prune the face views that are too close together as well as motion-blurred frames.

We apply the segmentation CNN, $G_{s}$ , to the frames, and prune the frames for which less than 15% of the pixels in the face bounding box were classified as face pixels. We used dlib’s face verification111Available: http://dlib.net/ to group frames according to the subject identity, and limit the number of frames per subject to 100, by choosing frames with the maximal variance in 2D landmarks. In each training iteration, we choose the frames $I_{s}$ and $I_{t}$ from two randomly chosen subjects.

We trained VGG-19 CNNs for the perceptual loss on the VGGFace2 dataset [9] for face recognition and the CelebA [27] dataset for face attribute classification. The VGGFace2 dataset contains 3.3M images depicting 9,131 identities, whereas CelebA contains 202,599 images, annotated with 40 binary attributes.

We trained the segmentation CNN, $G_{s}$ , on data used by others [35], consisting of ${\sim}10k$ face images labeled with face segmentations. We also used the LFW Parts Labels set [19] with ${\sim}3k$ images labeled for face and hair segmentations, removing the neck regions using facial landmarks.

We used additional 1k images and corresponding hair segmentations from the Figaro dataset [43]. Finally, FaceForensics++ [39] provides 1000 videos, from which they generated 1000 synthetic videos on random pairs using DeepFakes [12] and Face2Face [44].

4.2 Training details

We train the proposed generators from scratch, where the weights were initialized randomly using a normal distribution. We use Adam optimization [24] ( $\beta_{1}=0.5,\beta_{2}=0.999$ ) and a learning rate of $0.0002$ . We reduce this rate by half every ten epochs. The following parameters were used for all the generators: $\lambda_{perc}=1,\lambda_{pixel}=0.1,\lambda_{adv}=0.001,\lambda_{seg}=0.1,\lambda_{rec}=1,\lambda_{stepwise}=1$ , where $\lambda_{reenactment}$ is linearly increased from 0 to 1 during training. All of our networks were trained on eight NVIDIA Tesla V100 GPUs and an Intel Xeon CPU. Training of $G_{s}$ required six hours to converge, while the rest of the networks converged in two days. All our networks, except for $G_{s}$ , were trained using a progressive multi scale approach, starting with a resolution of 128 $\times$ 128 and ending at 256 $\times$ 256. Inference rate is ${\sim}30$ fps for reenactment and ${\sim}10$ fps for swapping on one NVIDIA Tesla V100 GPU.

5 Experimental results

We performed extensive qualitative and quantitative experiments to verify the proposed scheme. We compare our method to two previous face swapping methods: DeepFakes [12] and Nirkin et al. [35], and the Face2Face reenactment scheme [44]. We conduct all our experiments on videos from FaceForensics++ [39], by running our method on the same pairs they used. We further report ablation studies showing the importance of each component in our pipeline.

5.1 Qualitative face reenactment results

Fig. 3 shows our raw face reenactment results, without background removal. We chose examples of varying ethnicity, pose, and expression. A specifically interesting example can be seen in the rightmost column, showing our method’s ability to cope with extreme expressions. To show the importance of iterative reenactment, Fig 4 provides reenactments of the same subject for both small and large angle differences. As evident from the last column, for large angle differences, the identity and texture are better preserved using multiple iterations.

5.2 Qualitative face swapping results

Fig. 5 offers face swapping examples taken from FaceForensics++ videos, without training our model on these videos. We chose examples that represent different poses and expression, face shapes, and hair occlusions. Because Nirkin et al. [35] is an image-to-image face swapping method, to be fair in our comparison, for each frame in the target video we select the source frame with the most similar pose. To compare FSGAN in a video-to-video scenario, we use our face view interpolation described in Sec. 3.3.

5.3 Comparison to Face2Face

We compare our method to Face2Face [44] on the expression only reenactment problem. Given a pair of faces $F_{s}\in I_{s}$ and $F_{t}\in I_{t}$ the goal is to transfer the expression from $I_{s}$ to $I_{t}$ . To this end, we modify the corresponding 2D landmarks of $F_{t}$ by swapping in the mouth points of the 2D landmarks of $F_{s}$ , similarly to how we generate the intermediate landmarks in Sec. 3.2. The reenactment result is then given by $G_{r}(I_{t};H(\hat{p}_{t}))$ , where $\hat{p}_{t}$ are the modified landmarks. The examples are shown in Fig. 6.

5.4 Quantitative results

We report quantitative results, conforming to how we defined the face swapping problem: we validate how well methods preserve the source subject identity, while retaining the same pose and expression of the target subject. To this end, we first compare the face swapping result, $F_{b}$ , of each frame to its nearest neighbor in pose from the subject face views. We use the dlib [23] face verification method to compare identities and the structural similarity index method (SSIM) to compare their quality. To measure pose accuracy, we calculate the Euclidean distance between the Euler angles of $F_{b}$ to the original target image, $I_{t}$ . Similarly, the accuracy of the expression is measured as the Euclidean distance between the 2D landmarks. Pose error is measured in degrees and the expression error is measured in pixels. We computes the mean and variance of those measurements on the first 100 frames of the first 500 videos in FaceForensics++, averaging them across the videos. As baselines, we use Nirkin et al. [35] and DeepFakes [12].

Evident from the first two columns of Table 1, our approach preserves identity and image quality similarly to previous methods. The two rightmost metrics in Table 1 show that our method retains pose and expression much better than its baselines. Note that the human eye is very sensitive to artifacts on faces. This should be reflected in the quality score but those artifacts usually capture only a small part of the image and so the SSIM score does not reflect them well.

5.5 Ablation study

We performed ablation tests with four configurations of our method: $G_{r}$ only, $G_{r}+G_{c}$ , $G_{r}+G_{b}$ , and our full pipeline. The segmentation network, $G_{s}$ , is used in all configurations. Qualitative results are provided in Fig. 7.

Quantitative ablation results are reported in Table 2. Verification scores show that source identities are preserved across all pipeline networks. From Euler and landmarks scores we see that target poses and expressions are best retained with the full pipeline. Error differences are not extreme, suggesting that the inpainting and blending generators, $G_{c}$ and $G_{b}$ , respectively, preserve pose and expression similarly well. There is a slight drop in the SSIM, due to the additional networks and processing added to the pipeline.

6 Conclusion

Limitations. Fig. 4 shows our reenactment results for different facial yaw angles. Evidently, the larger the angular differences, the more identity and texture quality degrade. Moreover, too many iterations of the face reenactment generator blur the texture. Unlike 3DMM based methods, e.g., Face2Face [44], which warp textures directly from the image, our method is limited to the resolution of the training data. Another limitation arises from using a sparse landmark tracking method that does not fully capture the complexity of facial expressions.

Discussion. Our method eliminates laborious, subject-specific, data collection and model training, making face swapping and reenactment accessible to non-experts. We feel strongly that it is of paramount importance to publish such technologies, in order to drive the development of technical counter-measures for detecting such forgeries, as well as compel law makers to set clear policies for addressing their implications. Suppressing the publication of such methods would not stop their development, but rather make them available to select few and potentially blindside policy makers if it is misused.

Appendix A Additional qualitative results

We offer additional quantitative face swapping results in Fig. 8. We have specifically chosen examples of challenging pairs, with partial occlusions, different ethnicities and skin colors, demonstrating the competence of our method on a large variety of subjects. In Fig. 9, we show additional quantitative comparison to Nirkin et al. [35] and DeepFakes [12], and in Fig. 10 we show another comparison to Face2Face [44]. Please also see the attached video for more results.

Appendix B The architecture of the generator CNNs

The architecture of the generators, $G_{r}$ , $G_{c}$ , and $G_{b}$ , is based on the pix2pixHD approach [47], and the layout of the global generator and enhancer is depicted in Fig. 11. The global generator is defined by the number of bottleneck blocks (shown in purple) used in each resolution scale. In our experiments we used only three resolutions. The enhancer is defined by its submodule, that is, either the global generator or another enhancer, and its number of bottleneck layers. The generators are thus given by

[TABLE]

and

[TABLE]

The face segmentation network $G_{s}$ is based on the U-Net approach [38], for which we replaced the deconvolution layers with bilinear interpolation upsampling layers.

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Oleg Alexander, Mike Rogers, William Lambeth, Matt Chiang, and Paul Debevec. Creating a photoreal digital actor: The digital emily project. In Conf. Visual Media Production , pages 176–187. IEEE, 2009.
2[2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875 , 2017.
3[3] Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F Cohen. Bringing portraits to life. ACM Transactions on Graphics (TOG) , 36(6):196, 2017.
4[4] Dmitri Bitouk, Neeraj Kumar, Samreen Dhillon, Peter Belhumeur, and Shree K Nayar. Face swapping: automatically replacing faces in photographs. ACM Trans. on Graphics , 27(3):39, 2008.
5[5] Volker Blanz, Sami Romdhani, and Thomas Vetter. Face identification across different poses and illuminations with a 3d morphable model. In Int. Conf. on Automatic Face and Gesture Recognition , pages 192–197, 2002.
6[6] Volker Blanz, Kristina Scherbaum, Thomas Vetter, and Hans-Peter Seidel. Exchanging faces in images. Comput. Graphics Forum , 23(3):669–676, 2004.
7[7] Volker Blanz and Thomas Vetter. Face recognition based on fitting a 3d morphable model. Trans. Pattern Anal. Mach. Intell. , 25(9):1063–1074, 2003.
8[8] Xavier P Burgos-Artizzu, Pietro Perona, and Piotr Dollár. Robust face landmark estimation under occlusion. In Proc. Int. Conf. Comput. Vision , pages 1513–1520. IEEE, 2013.