Dual-reference Age Synthesis

Yuan Zhou; Bingzhang Hu; and Jun He; Yu Guan; Ling Shao

arXiv:1908.02671·cs.CV·June 19, 2020

Dual-reference Age Synthesis

Yuan Zhou, Bingzhang Hu, and Jun He, Yu Guan, Ling Shao

PDF

TL;DR

This paper introduces a dual-reference age synthesis framework that generates age-progressed or regressed images by using two input images, one for identity and one for age, enabling more flexible and accurate age synthesis.

Contribution

The paper proposes a novel dual-reference framework for age synthesis that uses two images instead of a fixed age number, improving flexibility and control in age transformation.

Findings

01

Effective on UTKFace and CACD datasets

02

Outperforms traditional single-reference methods

03

Demonstrates high flexibility and realistic results

Abstract

Age synthesis methods typically take a single image as input and use a specific number to control the age of the generated image. In this paper, we propose a novel framework taking two images as inputs, named dual-reference age synthesis (DRAS), which approaches the task differently; instead of using "hard" age information, i.e. a fixed number, our model determines the target age in a "soft" way, by employing a second reference image. Specifically, the proposed framework consists of an identity agent, an age agent and a generative adversarial network. It takes two images as input - an identity reference and an age reference - and outputs a new image that shares corresponding features with each. Experimental results on two benchmark datasets (UTKFace and CACD) demonstrate the appealing performance and flexibility of the proposed framework.

Tables8

Table 1. Table 1: Description of 11 subsets for identity feature learning performance analysis.

Identity	Age Range
id0	14-23
id1	15-23
id2	15-24
id3	17-26
id4	20-29
id5	21-30
id6	26-33
id7	32-41
id8	43-52
id9	46-55
id10	49-58

Table 2. Table 3: Comparative results of identity preservation with four models. Model M 1 𝑀 1 M1 and M 3 𝑀 3 M3 generally obtain higher verification confidence.

Model	Average Verification Confidence
M1	80.37 $\pm$ 3.13
M2	78.98 $\pm$ 0.63
M3	82.10 $\pm$ 2.83
M4	79.19 $\pm$ 1.71

Table 3. Table 4: Effect of age preservation function. Bold numbers are the maximums.

Age Groups	Accuracy of $M 1$ ( $%)$	Accuracy of $M 2$ ( $%)$	Accuracy of $M 3$ ( $%)$	Accuracy of $M 4$ ( $%)$
0 $#$ (0-5)	99.93	99.8	99.86	99.93
1 $#$ (6-10)	99.63	92.85	93.32	99.49
2 $#$ (11-15)	92.47	74.44	74.88	93.86
3 $#$ (16-20)	93.12	78.98	80.41	92.17
4 $#$ (21-30)	100.00	99.97	99.9	100.00
5 $#$ (31-40)	93.90	89.97	90.54	93.83
6 $#$ (41-50)	84.71	71.59	74.1	85.12
7 $#$ (51-60)	90.10	92.51	97.63	97.02
8 $#$ (61-70)	97.29	98.20	99.53	99.66
9 $#$ (70+)	99.97	98.58	99.56	99.90
Average Acc.( $%$ )	96.04	89.36	90.33	96.15

Table 4. Table 5: Comparative results of identity preservation of the 3 methods. Synthesized images of IPCGAN look almost the same as their identity reference images, which is the reason for the highest average verification confidence of IPCGAN. However, it is not an ideal method for lacking of aging effect.

Method	Average Verification Confidence
CAAE	71.43 $\pm$ 2.00
IPCGAN	94.13 $\pm$ 0.64
DRAS	81.52 $\pm$ 0.83

Table 5. Table 6: Identity consistency of CAAE.

Age Groups	grp.1	grp.2	grp.3	grp.4	grp.5
grp.1	-	73.53 $\pm$ 7.71	71.10 $\pm$ 10.08	66.75 $\pm$ 10.40	68.696 $\pm$ 6.36
grp.2	-	-	85.30 $\pm$ 2.51	76.65 $\pm$ 4.19	72.47 $\pm$ 4.20
grp.3	-	-	-	79.08 $\pm$ 8.24	78.69 $\pm$ 1.80
grp.4	-	-	-	-	77.036 $\pm$ 2.27
Average Confidence	75.49 $\pm$ 6.91	81.07 $\pm$ 3.72	82.31 $\pm$ 4.53	79.38 $\pm$ 5.02	78.86 $\pm$ 2.93

Table 6. Table 7: Identity consistency of IPCGAN.

Age Groups	grp.1	grp.2	grp.3	grp.4	grp.5
grp.1	-	95.86 $\pm$ 0.27	94.91 $\pm$ 0.16	94.65 $\pm$ 0.30	94.72 $\pm$ 0.43
grp.2	-	-	95.78 $\pm$ 0.29	95.00 $\pm$ 0.16	94.90 $\pm$ 0.50
grp.3	-	-	-	94.42 $\pm$ 2.17	94.25 $\pm$ 1.56
grp.4	-	-	-	-	96.50 $\pm$ 0.24
Average Confidence	95.50 $\pm$ 0.22	95.78 $\pm$ 0.24	95.35 $\pm$ 0.83	95.88 $\pm$ 0.58	95.55 $\pm$ 0.54

Table 7. Table 8: Identity consistency of DRAS.

Age Groups	grp.1	grp.2	grp.3	grp.4	grp.5
grp.1	-	87.09 $\pm$ 4.66	89.73 $\pm$ 3.33	83.04 $\pm$ 3.39	77.27 $\pm$ 8.59
grp.2	-	-	90.60 $\pm$ 3.38	86.90 $\pm$ 2.24	82.15 $\pm$ 5.25
grp.3	-	-	-	89.46 $\pm$ 2.68	87.83 $\pm$ 2.85
grp.4	-	-	-	-	88.21 $\pm$ 3.16
Average Confidence	86.91 $\pm$ 3.99	88.83 $\pm$ 3.11	91.00 $\pm$ 2.25	89.00 $\pm$ 2.30	86.58 $\pm$ 4.00

Table 8. Table 9: Age preservation performance of the CAAE and DRAS. Bold numbers are the maximums.

Age Groups	Accuracy of CAAE ( $%$ )	Accuracy of DRAS ( $%$ )
0 $#$ (0-5)	99.94	99.93
1 $#$ (6-10)	97.59	99.63
2 $#$ (11-15)	79.03	92.47
3 $#$ (16-20)	84.83	93.12
4 $#$ (21-30)	99.91	100.00
5 $#$ (31-40)	97.27	93.90
6 $#$ (41-50)	75.64	84.71
7 $#$ (51-60)	90.53	90.10
8 $#$ (61-70)	96.65	97.02
9 $#$ (70+)	96.46	99.97
Average Accuracy( $%$ )	91.78	96.04

Equations12

L_{a g e} = ∣∣ E_{A} (I_{j}^{n}) - E_{A} (\tilde{I_{i}^{n}}) ∣ ∣_{2},

L_{a g e} = ∣∣ E_{A} (I_{j}^{n}) - E_{A} (\tilde{I_{i}^{n}}) ∣ ∣_{2},

L_{r ec} = ∣∣ I_{i}^{m} - \tilde{I}_{i}^{m} ∣ ∣_{1},

L_{r ec} = ∣∣ I_{i}^{m} - \tilde{I}_{i}^{m} ∣ ∣_{1},

L_{z_{I}} = E_{I} min D_{I} max

L_{z_{I}} = E_{I} min D_{I} max

L_{i d} = ∣∣ E_{I} (I_{i}^{m}) - E_{I} (\tilde{I}_{i}^{n}) ∣ ∣_{2} .

L_{i d} = ∣∣ E_{I} (I_{i}^{m}) - E_{I} (\tilde{I}_{i}^{n}) ∣ ∣_{2} .

L_{a d v} = G min D max

L_{a d v} = G min D max

E_{I}, E_{A}, G min D_{I}, D max

E_{I}, E_{A}, G min D_{I}, D max

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\tnotemark

[1]

\tnotetext

[1]This work was supported by the Jiangsu Overseas Visiting Scholar Program for University Prominent Young and Middle aged Teachers and Presidents.

[type=editor, auid=000,bioid=1, orcid=0000-0002-8224-6068] \cormark[1] \creditConceptualization of this study, Methodology, Writing - Original draft preparation

\credit

Conceptualization of this study, Methodology

\credit

Writing - Original draft preparation

\credit

Resources

\cormark

[2]

\creditSupervision, Writing - Review and Editing

\cortext

[cor1]Corresponding author \cortext[cor2]Principal corresponding author

Dual Reference Age Synthesis

Yuan Zhou [email protected] School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing, China

Bingzhang Hu [email protected] School of Computing, Newcastle University, UK

Jun He [email protected]

Yu Guan [email protected]

Ling Shao [email protected] Inception Institute of Artificial Intelligence (IIAI) Abu Dhabi

Abstract

Age synthesis methods typically take a single image as input and use a specific number to control the age of the generated image. In this paper, we propose a novel framework taking two images as inputs, named dual-reference age synthesis (DRAS), which approaches the task differently; instead of using “hard” age information, i.e. a fixed number, our model determines the target age in a “soft” way, by employing a second reference image. Specifically, the proposed framework consists of an identity agent, an age agent and a generative adversarial network. It takes two images as input - an identity reference and an age reference - and outputs a new image that shares corresponding features with each. Experimental results on two benchmark datasets (UTKFace and CACD) demonstrate the appealing performance and flexibility of the proposed framework.

keywords:

age synthesis \sepdual reference \sep“soft” age information \sepconditional generative adversarial network

1 Introduction

Age synthesis, also known as face aging and rejuvenation or age progression and regression, aims to predict aging or rejuvenating effects on an individual’s facial images, while preserving personality features. It has received significant research interest in recent years due to its importance for a wide range applications, i.e. finding missing people, face verification, security surveillance, entertainment, etc. Before the emergence of generative adversarial network (GAN) [1], popular age synthesis algorithms focuse on the shape or texture analysis, which is related to craniofacial growth or skin aging in age progression [2], or consider shape and texture synthesis simultaneously [3]. With the breakthrough of GAN, synthesis methods based on GAN have yielded great progress. Conventional age synthesis methods are categorized into three groups: methods based on physical model, prototype based methods and GAN based methods. Physical model based methods use a parametric anatomical model to describe the face aging procedure, including how the facial skin changes, the physical mechanism on facial cranial growth, and facial muscle changes[4, 5, 6, 7, 8]. However, these physical model based methods are computationally expensive and complex [9, 10, 11]. In prototype-based methods, prototypes are learned to define the salient feature at different ages, and then the age transformation is depicted as the discrepancy between two prototypes [12, 13, 14]. This learned age transformation is then applied to an input face to produce the corresponding aging effects. However, the prototypes are simply averages of facial features and are thus unable to preserve identity information [15]. The GAN-based methods typically combine a GAN with an encoder for age synthesis, where the GAN is used to synthesize an image with the identity feature learned by the encoder [16, 17]. The age of synthesized image can be controlled by transforming a fixed target age to a one-hot vector. For optimal performance, GAN-based methods require a huge volume of pair-wise images (i.e. facial images of the same person across a large age span) which are difficult and often infeasible to obtain.

Existing works use either a fixed number or descriptions like “young” and “old” to represent the age information desired in the output. However, different people may look different ages, even if they’re not, while different observers will have different understandings of “young” and “old”. This naturally rises the question: are current depictions adequate enough for accurately describing human age? Conventional methods use a number to control the age of a synthesised image, as shown in Figure 1(a). However, one drawback to this is that a single number does not fully capture human perceptions of age. As the old saying goes, “a picture is worth a thousand words”, a facial image provides far more age information than a number or an average depiction does. Thus, we propose a new task: can we use someone else image, at a specific age, to set the target age for face age synthesis? To tackle this task, we propose a novel framework, in which an age reference image, in addition to the identity reference image, is input to reflect the target age. We refer to the proposed framework as dual-reference age synthesis (DRAS), as shown in Figure 1(b).

The contributions of this paper are three-fold:

Task: We propose a new task to synthesize images of one input face at a similar age to that of a second input image. The proposed task successfully addresses the problem of a single number not being able to effectively represent human age. 2. 2.

Framework: A unified framework is proposed to tackle the new task. Using the mechanisms of two independent discriminators, the proposed framework can generate images with the similar age as the age reference image, while preserving identity information. 3. 3.

Performance: Extensive experiments and detailed analyses are conducted on two benchmark datasets: UTKFace and CACD. Our model achieves the best performance among compared methods for age synthesis, and is more feasible, especially for tasks lacking ground truth or pair-wise datasets.

2 Related Work

Physical models based on facial landmarks have been used for real world age synthesis tasks since as far back as 2002. Lanitis et al. investigated three aging formulations and used 50 raw parameters to describe aging effects on facial appearance [18]. Mukaida and Ando [19] extracted and seperated facial wrinkles and spots for age synthesis by analyzing the properties of pixel distributions in local areas. Gandhi et al. intorduced a real-world age synthesis system [20] which mainly focused on texture synthesis by considering both signature images and regression-based age prediction. Ramanathan et al. defined a craniofacial growth model, including a shape aging model and a texture aging model, to characterize adult facial shape and textural variations occurring with age [21]. Fu and Zheng [22] presented the M-Face framework that the shape caricaturing is integrated for associated shape deformation. Since these models consider aging effects as a continuous procedure, i.e. from “young” to “old”, dense long-term face aging sequences are needed to obtain the “best” model. To tackle the lack of sufficient long-term face aging sequences for model learning, Suo et al. attempted to model facial muscle patterns from available short-term aging databases by a proposed concatenational graph evolution aging model. Later on, they decomposed human faces into mutually interrelated sub-regions under anatomical guidance, and proposed an aging model by connecting sequential short-term patterns following the Markov property of the aging process [23]. Recently, facial textures comprising skin texture details around facial meso-structures (e.g. eyes, nose and mouth) have been used to represent the aging effect, surpassing prior work [24].

Prototype model based methods developed almost in parallel to the physical model based methods. Rather than modeling continuous aging effects, like the physical model based methods do, prototype model methods divide age ranges into discrete. Tiddeman et al. proposed wavelet-based methods for prototyping facial textures and shapes, and for artificially transforming the age of facial images [12]. Not focusing on facial shape or texture, a novel technique named image-based surface detail transfer (IBSDT) was proposed. In IBSDT, aging effects in facial image are obtained by transferring the bumps from an old person’s skin surface to a young person’s face. Though IBSDT is simple to implement, it needs to manually add markers to the boundaries and the feature points [15]. These early prototype based methods used average facial shape, textures, and bumps to describe aging transformation directly. Then, Kemelmacher-Shlizerman et al. proposed an illumination-aware age progression approach(IAAP) to compute average image subspaces, and use these average depictions (shape, texture) to yield an age progressed result [14]. Thereafter, trying to maintain personality, Shu et al. [25] proposed a coupled dictionary learning(CDL) method. In CDL, a dictionary for each age group is learnt and every two neighbouring dictionaries are learnt jointly. However, this method still has ghost artifacts as the reconstruction residual does not evolve over time [9]. After that, Shu et al. proposed a Kinship-Guided Age Progression (KinGAP) approach which can generate personalized aging images by computing average ageed faces taking the senior family members as a prior guidance [26]. Bukar et al. proposed a novel algorithm [27, 28] hybriding the active appearance models(AAM) [29] and face patches method to produce aing images with fine facial texture details which eliminated illumination differences. First, an invertible model of age synthesis is developed using AAM and sparse partial least squares regression (sPLS). Then the texture details of the face are enhanced using the patch-based synthesis approach.

Physical and prototype model based methods have dominated age synthesis in the last decade, however, their disadvantages, i.e. computational cost, complexity and missing facial details, have hindered high quality synthesis. The conditional generative adversarial network (cGAN) [30] broke this challenge. cGAN introduces condition into the original GAN to control the generated results, making end-to-end age synthesis possible [30]. Given a uniform noise $z$ and a condition $y$ , they are combined in a joint hidden representation and are mapped to data spaces as $\tilde{x}$ by a generator. $y$ is fed into a discriminator as an additional input with a real data $x$ or a fake data $\tilde{x}$ . The discriminator tries to tell them apart, while the generator is trained to prevent this. Moreover, Makhzani et al. proposed an Adversarial Autoencoder (AAE), which can be used to learn identity features [31]. The controllable character of cGAN and the latent vector learning ability of AAE inspired GAN-based methods, which use AAE to learn identity features and cGAN to generate aged facial image [17, 32, 33]. Different from physical and prototype model based methods, GAN-based methods use a number to represent age. To disentangle personality and age, Zhang et al. proposed a conditional adversarial auto-encoder (CAAE), which includes an encoder and a GAN [10]. Personal identities were determined by mapping the original face image to a latent vector via the encoder, then these identities and a corresponding numeral (age) were fed into the GAN to synthesize facial images. Antipov et al. proposed an Age Conditional Generative Adversarial Network (Age-GAN) which used Facenet to optimize latent identity vectors [17]. Age-GAN can be considered a type of CAAE. Recently, focusing on identity preservation, Wang et al. proposed an identity-preserving conditional generative adversarial networks (IPCGANs) using an age classifier to force the generated face to be within the target age group [11]. To obtain more realistic images, Li et al. proposed a Wavelet-domain Global and Local Consistent Age Generative Adversarial Network (WaveletGLCA-GAN). On the other hand, there are GAN-based methods which don’t explicitly model face aging synthesis, however, they are feasible for the task, e.g. Expression Generative Adversarial Network (ExprGAN) [34] and StarGAN [35]. To the best of our knowledge, these methods still use a specific number to describe age group information and require pair-wise or annotated data. It worth noting that our work seems very related to IP-GANs [36], where the latter uses the Gaussian distribution to regularize all attributes’ features. However, age doesn’t follow a Gaussian distribution, which means IP-GANs is not a suitable choice for age synthesis.

3 Proposed method

In this section, we first describe the framework of our proposed method. Two main modules of the framework are discussed in Sec.3.2 and Sec.3.3, respectively. Finally, the objective functions are introduced.

3.1 Overview

Given an arbitrary image, can you imagine what will (did) he/she look like in the future (past)? Figure 2 shows an example of the age progression/regression results. The input images (with black dotted boxes) are manipulated into “child”, “young” and “old”.

Different people of the same age often have different age appearances. Therefore, rather than providing a fixed numerical “hard” age, it is more reasonable to refer to the “soft” version of age information extracted from a facial image via a deep encoder network. Our DRAS framework consists of three parts: an age agent, an identity agent and a GAN. The age and identity features are learned by means of the age and identity agents, respectively. The GAN is used to synthesize photo-realistic facial images.

Figure 3 describes the framework of our proposed method. For convenience, we define $I_{i}^{m}$ as the identity reference image of the individual with identity $i$ at age $m$ , and $I_{j}^{n}$ as the age reference image of the individual with identity $j$ at age $n$ . We assume that the face image is sampled from two low dimensional manifolds: the age manifold and identity manifold, where the identity and age change smoothly along their respective dimensions. The two raw reference images are first projected onto the identity and age manifolds, respectively, via the identity agent $E_{I}$ and the age agent $E_{a}$ . Subsequently, the identity and age features are sampled from these two manifolds, respectively. Moreover, a discriminator $D_{I}$ is coupled with the identity agent to ensure that the identity features follow a uniform distribution. Then, the identity and age features form a joint feature, which is fed into the generator. Finally, the generator synthesizes a facial image which not only shares the same identity feature as the identity reference image but also shares the same age feature as the age reference image. The hybrid loss function is used to optimize our model. Five losses are included in the hybrid loss function: a reconstruction loss $\mathcal{L}_{rec}$ , two adversarial losses $\mathcal{L}_{Z_{I}}$ and $\mathcal{L}_{adv}$ , and two preservation functions $\mathcal{L}_{id}$ and $\mathcal{L}_{age}$ . For detailed information, please refer to the following discussions.

3.2 Age Agent

An age agent is designed for the proposed framework based on the deep expectation of apparent age (DEX) [37, 38], pretrained on ImageNet [39]. Two fully-connected layers are introduced, while removing the last fully-connected layer from the original DEX. The sizes of these two new fully-connected layers are 1024 and 50, and the 50-dimensional output of the age agent is the final age feature. Furthermore, an image with size $224\times 224$ is required as the input age reference image.

Age preservation: The image generated by our DRAS should has the similar age as the reference image, which means the age feature difference between the two should be as small as possible. Here, we use the age preservation loss to describe the similarity of the age feature, referring to Equation (1).

[TABLE]

where $E_{A}(\cdot)$ is the age feature, and $\tilde{I}_{i}^{n}$ is the synthesized image.

In contrast to conventional methods, by introducing the age preservation loss, the age agent is trained in an unsupervised manner without age annotation but only with the ground-truth age feature $E_{A}(I_{j}^{n})$ . When training our DRAS, parameters of the last two full-connected layers are optimized to better learn age features through back-propagation of the age preservation loss. Moreover, using the 50-dimensional feature rather than the conception features (congregated multi-layer outputs of deep networks) makes our framework light-weight [16, 34].

3.3 Identity Agent

The identity agent consists of an encoder $E_{I}$ and a discriminator $D_{I}$ , whose architectures are adapted from [10]. The encoder takes a $128\times 128\times 3$ image as input.

Reconstruction: In order to extract identity features from identity reference images without pair-wise or labeled training data, a reconstruction loss is used:

[TABLE]

where $\tilde{I}_{i}^{m}=G(E_{I}(I_{i}^{m}),E_{A}(I_{i}^{m}))$ . A smaller reconstruction loss value intuitively means the reconstructed image is more similar to the original image at a pixel level. In other words, $\tilde{I}_{i}^{m}$ is the synthesized image at the same age as the identity reference image $I_{i}^{m}$ . Furthermore, if the reconstruction loss value is zero, $\tilde{I}_{i}^{m}$ is $I_{i}^{m}$ exactly. Since we do not have any identity information about training data, the original image is used as the ground truth for adversarial training.

Following [10], the identity feature is assumed to follow a uniform distribution, so the adversarial process forces the estimated identity manifold covering the identity distribution as best as possible. Denoting with $p_{data}(I)$ the distribution of the identity reference data $I$ and $p_{z}$ the prior uniform distribution of identity feature $z_{I}$ , the identity feature is trained to approximate a uniform distribution by:

[TABLE]

where $E_{I}(\cdot)$ is the identity feature, and $E_{I}$ and $D_{I}$ denote the identity encoder and identity discriminator.

Identity preservation: To further guarantee that the synthesized images preserve the identity information, we introduce the identity preservation loss into the identity agent:

[TABLE]

The identity preservation loss enhances the identity feature learning ability. Synthesized images of the same identity at different ages are given the same identity information by minimizing $\mathcal{L}_{id}$ , which disentangles the identity feature from the age feature.

3.4 Generator and Discriminator

Following the work in [10], the generator $G$ and the image discriminator $D$ have the same architecture as CAAE, except the input: the input of DRAS consists of two images, one is for identity reference and the other is for age reference, while CAAE requires an image for identity reference and a number for age reference.

To generate a photo-realistic face image, the discriminator tries to discriminate the two reference images as real and the generated image as fake. Thus, the adversarial loss function can be derived as:

[TABLE]

where $\tilde{I}_{i}^{n}=G(E_{I}({I}_{i}^{m}),E_{A}({I}_{j}^{n}))$ . As with the original GANs, the generator and discriminator are alternately optimized via the adversarial loss.

3.5 Objective Function

To guarantee the performance of our model, a hybrid loss function is constructed, which consists of the identity feature preservation loss, the age feature preservation loss and two adversarial losses. Equation (6) shows the overall objective function:

[TABLE]

where $\lambda_{adv}$ , $\lambda_{id}$ and $\lambda_{age}$ are weights to control the impact of these loss terms.

The identity agent, the age agent and the generator are optimized by minimizing Equation (6), and the discriminators are optimized by maximizing Equation (6).

4 Experiments

4.1 Data Description

We conduct experiments on two widely used benchmark face datasets: UTKFace [10]111https://susanqq.github.io/UTKFace/ and Cross-Age Celebrity Dataset (CACD) [40]222http://bcsiriuschen.github.io/CARC/. There are over 20,000 facial images without identity annotations in the UTKFace dataset, and 2,000 celebrities in the CACD dataset. Note that though images in UTKFace are in-the-wild, most images are of good quality. However, images in CACD with rank higher than five are “low quality”, for example, some have wrong identity labels or wrong age labels which can’t be used to verify the abilities of identity preservation and age preservation, and some are even not photoes of real person that can’t be the reference image of our model. Therefore, we choose those images with rank smaller or equal to five [40]. Images are divided into ten age groups according to their age annotations (real ages). Figure 5 shows the age distributions. Only UTKFace includes babies (zero to five-years-old), children(six to ten-years-old) and senior people (above 70-years-old). The number of people between 20-years-old and 40-years-old is about as twice that of other age groups. In terms of morphology, children (under ten-years-old) have different facial appearances from teenagers and adults, e.g. different width between their eyes, face shapes, etc. In order to avoid over-fitting or under-fitting, we augment UTKFace and CACD by flipping images of babies, children and seniors.

4.2 Implementation Details

80 $\%$ of images are used as training data, 10 $\%$ as validation data and the remaining $10\%$ as test data. All images are aligned and cropped, and normalized to [-1,1]. And the identity and age features are also normalized to [-1,1] to be unified with the reference images. We train our model on an NVIDIA TITAN X GPU with a decreasing learning rate (the default learning rate is $2e^{-}3$ ). We use a mini-batch size as 100, and set $\lambda_{adv}$ as 1, $\lambda_{id}$ as $1e^{-}3$ and $\lambda_{age}$ as $1e^{-}2$ .

Different from other typical methods, such as [41], the DRAS takes two images as inputs and doesn’t need any annotations of the identity or the age. Therefore, in the training step, we choose one image as the identity reference image and another as the age reference image randomly. It worth noting that both the two reference images are sampled randomly from the same training dataset, which means that an identity reference image can also be an age reference image and vice versa. Therefore, the reconstruction loss $\mathcal{L}_{rec}$ and the adversarial loss $\mathcal{L}_{Z_{I}}$ are only related to the identity reference image. Furthermore, our model will still work if the identity reference image is replaced by the age reference image in these two loss functions.

Empirically, it is difficult to achieve good performance if we train the model with the hybrid loss function directly. Thus, we apply a joint-training strategy. First, in order to learn the age and identity information and ensure that the approximated identity manifold covers the whole feature space, we set $I_{i}^{m}=I_{j}^{n}$ to reconstruct $I_{i}^{m}$ . In reconstruction stage, the identity agent is trained with the reconstruction loss $\mathcal{L}_{rec}$ and the adversarial loss $\mathcal{L}_{z_{I}}$ , and the age agent is trained with the reconstruction loss $\mathcal{L}_{rec}$ and the age preservation loss $\mathcal{L}_{age}$ . Furthermore, to guarantee the generated images be photo-realistic, the discriminator $D$ and the generator $G$ are trained with the other adversarial loss $\mathcal{L}_{adv}$ alternatively. Subsequently, after the losses of the identity agent and the age agent converge, we fix $E_{I}$ , $E_{A}$ and $D_{I}$ , set $I_{i}^{m}\neq{I_{j}^{n}}$ , and use the two preservation functions $\mathcal{L}_{id}$ and $\mathcal{L}_{age}$ to optimize the generator and discriminator.

4.3 Experimental Performance and Analysis

In this section, we first investigate the performance of our model, then select two baselines, CAAE [10] and IPCGAN [11], for comparison. Since conventional GAN-based methods use ten age groups to investigate different age effects, for fair comparison, we randomly choose one image from each group as the age reference images. Note that these ten age reference images are distinct from the training images to avoid over-fitting, as shown in Figure 5. As can be seen, the men in Figure 5(f), (g) and (h) are from different age groups according to their age annotations, but they look as old as each other.

4.3.1 Performance Evaluation of Disentangled Identity Feature Learning

Disentangled identity feature here means the identity features of different people should be isolated from each other, regardless of whether or not they are in the same age group. T-Distributed Stochastic Neighbor Embedding(t-SNE) [42, 43] depicts the similarities of identity features, which can be used to visualize the disentangled their disentangled representations. The t-SNE model outputs similar identity features for nearby points and dissimilar ones for distant points. Since the data in CACD have identity annotations, we evaluate the disentangle feature learning performance on this dataset. Images of 11 celebrities with different ages are collected from CACD, as described in Table 2.

Nine subsets from Table 2 are chosen and divided into three groups {[ $id0$ , $id1$ , $id2$ ], [ $id4$ , $id5$ , $id6$ ], [ $id8$ , $id9$ , $id10$ ]} at the same age respectively. To examine the performance of disentangled identity feature learning on the three groups, identity features of the nine people are retrieved using the identity agent for visualization, shown in Figure 6. The nine people are almost entirely isolated from each other. However, since pose, expression, face shape, etc. represent identity feature, some people with simialr poses, expressions or face shape overlap. For example, $id0$ and $id1$ have profiles in Figure 6(a), $id5$ and $id6$ have similar smiling and serious expressions in Figure 6(b)), and $id8$ and $id10$ both have long face shapes in Figure 6(c)).

Moreover, we randomly select six individuals from Table 2 to study the identity features of different people at various ages. In Figure 7, we can find that most samples are correctly isolated in the identity feature space. However, these are still some overlapping points across different identities. To further explore this, we plot out the corresponding images of those overlapping points. It can be seen that some of overlapping are caused by similar makeups, e.g. $id6$ and $id10$ , and some are because of their sharing similar expressions, e.g. $id9$ and $id10$ . It is also interesting to note that, these overlapping points prove the manifold assumption from the side.

In these two experiments, most of the identity features of different people fall in different clusters, regardless whether or not they are the same age, validating the disentangled feature learning ability of the identity agent.

4.3.2 Ablation Study

To validate the effects of the identity and age preservation losses, we design four ablation models, abbreviated $M1$ , $M2$ , $M3$ and $M4$ , as shown in Table 2. The images generated by the four models are shown in Figure 8.

In Figure 8, some images generated by $M2$ and $M3$ look younger than their age reference images, some images of $M2$ look male, but are actually female, and some images of $M4$ have artifacts on the local facial parts. For the younger appearances of $M2$ and $M3$ , it is mainly caused by the lack of the age preservation loss. For the incorrect male appearance of $M2$ , it is mainly caused by the lack of the identity preservation loss. For the undesired artifacts of $M4$ , it is due to only considering the age preservation loss, which promotes to generate images with more age information related features, such as wrinkles in the eye corner or wide eyes in children etc., which can be seen as artifacts by human eyes.

Effect of Identity Preservation To intuitively evaluate the effect of the identity preservation loss in the feature space, we also use t-SNE to visualize the identity features of the identity reference images and the generated images. Figure 9 shows the identity features of different people of the same age and Figure 10 shows the identity features of different people at different ages. Images in the first row ( $M1$ ) have the best performance in terms of in-cluster compactness and between-cluster separation, which demonstrates that the two preservation functions in DRAS can effectively conserve identity features and perform disentangling. Note that, the generated images have slight feature shifts from their identity reference images, which is caused by the compromise between the identity and age preservation losses. There are more overlapping points and more feature shifting in the second and the fourth rows ( $M2$ and $M4$ ), caused by the lack of identity preservation loss. In the third row ( $M3$ ), the identity features of the generated images tend to overlap with their identity reference images, which validates the fact that the identity preservation loss works in conjunction with the identity feature space.

We also use the online face comparator provided by Face++ to quantify the effect of identity preservation loss [44]. The confidence threshold is set as 73.975, and higher confidence means large likelihood between two face images. First, we compare each test image with it’s corresponding synthesized images. The comparative results in Table 3 indicate those models with the identity preservation function, i.e. model $M1$ and $M3$ , generally obtain higher verification confidence. And the quantitative result of $M1$ is lower than that of $M3$ which also demonstrates the compromise between the identity and age preservation losses, for example the wrinkle on face is not clear and makes face looked dirty. Note, the verification confidence gaps among the 4 models are small which explain the reconstruction loss plays an important role in identity perservation. Furthermore, to investigate the consistency of identity, we divide the synthesized images into 10 age groups and conduct the comparision among these age groups, i.e. [age group $i$ , age group $j$ ] ( $i\neq j$ ). The average confidences in Figure 11 show model $M1$ retain more identity consistency along with face aging.

By visual and quantitative evaluation, we conclude that the identity preservation function does enhance the model ability of identity preserving and consistency.

Effect of Age Preservation To discern whether a synthesized image has the same age feature as the age reference image, we use a pre-trained AlexNet model fine-tuned on UTKFace and CACD. Since each test image has ten synthesized images, the data is balanced. Additionally, we can use accuracy to describe age preservation performance: a higher accuracy suggests that more generated images have the same age features as their age reference images.

Age similarities in the $M1$ , $M2$ , $M3$ and $M4$ models are measured and the results are shown in Table 4. DRAS with age preservation function ( $M1$ and $M4$ ) obtains higher accuracy than that without it. Images are synthesized by model $M1$ using both the age preservation loss and the identity preservation loss, which leads to a compromise between accuracy in identity and age. Therefore, the average performance of DRAS using only the age preservation function is the highest.

4.3.3 Generative Performance Comparison

In this experiment, the performances of DRAS, CAAE and IPCGAN are compared in terms of their generated images. We use the codes published online by the authors333https://github.com/ZZUTK/Face-Aging-CAAE444https://github.com/dawei6875797/Face-Aging-with-Identity-Preserved-Conditional-Generative-Adversarial-Networks, with the same configurations as the original papers. For fair comparisons with the two baselines, we take the following experimental protocol: for the baseline CAAE, since its released model is trained on UTKFace, we replicate the experimental results of UTKFace and fine-tune the model on CACD; for the other baseline, IPCGAN, since its released model is trained on CACD, we replicate the experimental results of CACD and fine-tune the model on UTKFace. In this paper, as the released codes of IPCGAN were trained to synthesize images in five age groups (11-20, 21-30, 31-40, 41-50 and 50+), we use the same categories to generate images.

First, images are generated by the three respective methods, taking Figure 5(d)-(h) as their age reference images. In Figure 12, from left to right, the facial images generated by DRAS get older as their age reference images do. The generated images retain the age features of the reference images, e.g. round cheeks and bigger eyes for younger images. Regarding pose and expression, the generated images have the same identity features as their identity reference images. From Figure 12, we can also see that the age effects of the images generated by IPCGAN change slightly, while the synthesized images of CAAE look blurry and have artifacts. For IPCGAN, the slight aging effect is caused by the fact that it cannot effectively isolate the identity and age features from each other, since the identity features share the same part of convolutional network as the age classifier. For CAAE, the undesirable artifacts are inevitable because it only uses the reconstruction loss to preserve identity features without constraining the age preservation. Thus, compared with CAAE and IPCGAN, our model can generate higher quality facial images whose identities and ages are consistent with their reference images.

The images generated for the five age groups do not have corresponding ground truths. However, if both the identity and age information come from the identity reference image, then the image is equivalent to the ground truth. Therefore, for the dual reference images, we use one image as both the identity and age reference. The generative performance can also be evaluated by comparing the generated images with their ground truth, which ideally should be the same. As can be seen in Figure 13, the images generated by DRAS look similar to or even clearer than their ground truth. For CAAE and IPCGAN, images with red boxes are those that look different from their ground truths or are blurry in local facial parts.

Furthermore, facial images for Emma Waston and Isabella Rossellini were generated using different age reference images. Real images the same age as the reference images were taken as the ground truth. As shown in Figure 14, the results generated by DRAS are most photo-realistic and reasonable. The cheeks in the first synthesized image for Emma Watson just like those of the age reference image. However, the first two synthesized images of Isabella Rossellini look the same age, just as their age reference images. In contrast, images generated by IPCGAN all looked the same as their identity reference images, with no aging effect visible from their faces. For example, from left to right, the first age reference image looks much younger than the third one, yet the third synthesized image of Emma Waston looks just as young as the first. For CAAE, the third synthesized image looks male, which is not acceptable.

Overall, in these three experiments, the generative performance of DRAS is much higher than the other two methods.

4.3.4 Identity Preservation Comparison

In this experiment, in order to compare the identity preservation capability between our model and the other two methods, t-SNE is again used to visualize the synthesized images in the feature space. Figure 15 and Figure 16 show that most synthesized images are close to or even overlapping with their identity reference images in the identity feature space. This is because the three methods all have identity preservation strategy: the identity preservation and reconstruction losses in DRAS, the conceptual and reconstruction losses in IPCGAN, and the reconstruction loss in CAAE.

The identity features of DRAS and IPCGAN have more intra-cluster compactness, which suggests that only using the reconstruction loss dose not guarantee identity consistency. Moreover, the identity features of DRAS have more inter-cluster separation than the other two methods, demonstrating the disentangled feature learning ability of the identity agent. The t-SNE visualization result of IPCGAN, shown in Figure 15, presents several outliers, even for the same person (marked by dotted circles). This is because IPCGAN shares identity feature layers with the age classifier, making it difficult to disentangle the identity and age features. For the visualization result of CAAE, it is clearly undesirable to keep the identity features of different people always joint. CAAE lacks a disentangling ability mainly because it only considers identity preservation on a pixel level rather than in the feature space.

Quantitative face verification between each test image and its synthesized images is carried out to check the identity preservation performances of the 3 methods. The quantitative measurements between the test identity reference image and its synthesized images are shown in Table 5. We also perform face verification among synthesized images of the 3 methods and the comparative results of 5 age groups (grp. $n$ for the abbreviation of age group $n$ ) are shown in Tables 6, 7 and 8. From Figure 12, we find that synthesized images of IPCGAN look almost the same as their identity reference images, which is the reason for the highest average verification confidence of IPCGAN. However, it is not an ideal method for lacking of aging effect. For DRAS, the verification confidences in Table 5 and Table 8 are all surpass the threshold and outperform CAAE. As section 4.3.2 discussed, our model improves the identity preservation ability and renders age effect as well.

4.3.5 Age Preservation Comparison

In the last experiment, we compared the age preserving performance of the three methods. The quantitative comparison results are shown in Table 10 and Table 10. Our model clearly obtains the highest accuracy for seven age groups, with groups $0\#$ , $5\#$ and $7\#$ being the only exceptions, where the performance is slightly lower. When observing the original data distribution in Figure 5, the amount of CACD data in group $5\#$ is the most and nearly the same as that of UTKFace. In order to balance the data distribution, more CACD images in $5\#$ were augmented by flipping and cropping. Since most CACD images are celebrities with heavy makeup or exaggerated expression, etc., it is difficult to extract the exact age, which results in the lower performance in $5\#$ for our model.

5 Conclusion

In this paper we studied a new age synthesis task, namely dual reference age synthesis, and proposed a novel framework. The proposed framework takes two images as inputs, of which one refers to the identity in the target image and the other one refers to the age in the target image. Instead of using a given number as “hard” age information, the DRAS learns the “soft” age information from the age reference image without any age annotations. Compared to the conventional age synthesis methods, the GAN-based DRAS is able to generate higher quality images with more details and closer to the natural effects. Compared to the other GAN-based methods, the DRAS uses an image to describe the age information and doesn’t need pair-wise data for training model. Experimental results on UTKFace and CACD demonstrate the proposed approach showing promising results on this new task.

In this paper, we only consider the Euclidean metric for age preservation loss and identity preservation loss. In fact, a better choice is to minimize the distance between the two probablity distributions of the learnt feature and the reference feature. In our future work, we would like to investigate to use different probablitity divergence metrics as new loss functions.

Acknowledgments

This work was supported by the Jiangsu Overseas Visiting Scholar Program for University Prominent Young and Middle-aged Teachers and Presidents.

\printcredits

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in neural information processing systems, 2014, pp. 2672–2680.
2[2] K. Ricanek Jr, E. Boone, E. Patterson, Craniofacial aging impacts on the eigenface face biometric, Computer Science 1 (2006) 3.
3[3] Y. Fu, G. Guo, T. S. Huang, Age synthesis and estimation via faces: A survey, IEEE transactions on pattern analysis and machine intelligence 32 (11) (2010) 1955–1976.
4[4] L. S. Mark, J. B. Pittenger, H. Hines, C. Carello, R. E. Shaw, J. T. Todd, Wrinkling and head shape as coordinated sources of age-level information, Perception & Psychophysics 27 (2) (1980) 117–124.
5[5] A. J. O’toole, T. Vetter, H. Volz, E. M. Salter, Three-dimensional caricatures of human heads: distinctiveness and the perception of facial age, Perception 26 (6) (1997) 719–732.
6[6] A. J. O’Toole, T. Price, T. Vetter, J. C. Bartlett, V. Blanz, 3d shape and 2d surface textures of human faces: the role of “averages” in attractiveness and age, Image and Vision Computing 18 (1) (1999) 9–19.
7[7] J. Suo, S.-C. Zhu, S. Shan, X. Chen, A compositional and dynamic model for face aging, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (3) (2009) 385–401.
8[8] Y. Tazoe, H. Gohara, A. Maejima, S. Morishima, Facial aging simulator considering geometry and patch-tiled texture, in: ACM SIGGRAPH 2012 Posters, 2012, pp. 1–1.