Few-shots Portrait Generation with Style Enhancement and Identity   Preservation

Runchuan Zhu; Naye Ji; Youbing Zhao; Fan Zhang

arXiv:2303.00377·cs.CV·March 2, 2023

Few-shots Portrait Generation with Style Enhancement and Identity Preservation

Runchuan Zhu, Naye Ji, Youbing Zhao, Fan Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces StyleIdentityGAN, a few-shots portrait generation model that simultaneously enhances artistic style and preserves individual identity, requiring minimal reference data and outperforming existing methods.

Contribution

The paper presents a novel StyleIdentityGAN model that decouples style transfer and identity preservation, enabling high-quality portrait generation with few reference images.

Findings

01

Outperforms state-of-the-art methods in artistry and identity preservation

02

Requires only a small number of reference style images

03

Validated through qualitative, quantitative, and user studies

Abstract

Nowadays, the wide application of virtual digital human promotes the comprehensive prosperity and development of digital culture supported by digital economy. The personalized portrait automatically generated by AI technology needs both the natural artistic style and human sentiment. In this paper, we propose a novel StyleIdentityGAN model, which can ensure the identity and artistry of the generated portrait at the same time. Specifically, the style-enhanced module focuses on artistic style features decoupling and transferring to improve the artistry of generated virtual face images. Meanwhile, the identity-enhanced module preserves the significant features extracted from the input photo. Furthermore, the proposed method requires a small number of reference style data. Experiments demonstrate the superiority of StyleIdentityGAN over state-of-art methods in artistry and identity effects,…

Tables2

Table 1. Table 1: User Study of watercolor cartoon stylization on CAS-WACO dataset

Method	JoJoGAN		Ours
Strategy	One-shot	Few-shot	One-shot	Few-shot
Preference Score% ↑	3.00	14.00	29.22	48.74

Table 2. Table 2: FID and SSIM results

Dataset	CAS-WACO		APDrawing
Method	JoJoGAN	Ours	APDrawingGAN	U2Net	Ours
FID ↓	272.94	210.98	217.86	208.74	221.38
SSIM ↑	0.42	0.40	0.43	0.51	0.46

Equations4

ω_{s} = ω_{s}^{u n} + α * ω_{s}^{r e} + (1 - α) * ω_{r an d}^{r e}

ω_{s} = ω_{s}^{u n} + α * ω_{s}^{r e} + (1 - α) * ω_{r an d}^{r e}

L = L_{r e f} + λ_{f e a t u r e} * L_{f e a t u r e}

L = L_{r e f} + λ_{f e a t u r e} * L_{f e a t u r e}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zrc007/styleidentitygan
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · 3D Shape Modeling and Analysis

Full text

11institutetext: Institute11institutetext: College of Media Engineering, Communication University of Zhejiang, Hangzhou, China22institutetext: Key Lab of Film and TV Media Technology of Zhejiang Province, Communication University of Zhejiang, Hangzhou, China

22email: [email protected]

22email: {jinaye, zyb, fanzhang}@cuz.edu.cn

Few-shots Portrait Generation with Style Enhancement and Identity Preservation

Runchuan Zhu 22

Naye Ji(✉) 1122 0000-0002-6986-3766

Youbing Zhao 1122

Fan Zhang 1122 0000-0002-9534-1777

Abstract

Nowadays, the wide application of virtual digital human promotes the comprehensive prosperity and development of digital culture supported by digital economy. The personalized portrait automatically generated by AI technology needs both the natural artistic style and human sentiment. In this paper, we propose a novel StyleIdentityGAN model, which can ensure the identity and artistry of the generated portrait at the same time. Specifically, the style-enhanced module focuses on artistic style features decoupling and transferring to improve the artistry of generated virtual face images. Meanwhile, the identity-enhanced module preserves the significant features extracted from the input photo. Furthermore, the proposed method requires a small number of reference style data. Experiments demonstrate the superiority of StyleIdentityGAN over state-of-art methods in artistry and identity effects, with comparisons done qualitatively, quantitatively and through a perceptual user study. Code has been released on Github:111https://github.com/Zrc007/StyleIdentityGAN.

Keywords:

Image stylization Style transfer Few-shots portrait generation Virtual avatar.

1 Introduction

Stylization is the process which converts the style of the input image into one of the corresponding style reference samples. During the conversion process, the detailed features of the original input image should be kept as much as possible, while the line and texture of the reference are imitated. With the development of AI technology, image stylization can be widely used in video conferencing, entertainment and other fields in our daily life. Under the background of the massive demand for digital content in the current popular Meta Universe, the requirements of the automatic image stylization quality are becoming increasingly higher.

Stylization has a long history of research and can be divided into two categories: image-based style transfer and face-specific style transfer. Image-based style transfer is to render images with different styles and to retain the original image content as much as possible [9]. However, traditional style transfer methods can only extract low-level features (color, texture, etc.) of images to synthesize textures without high-level features. In recent years, deep learning has developed rapidly and has been widely applied in many fields. Deep neural network has incisive feature extraction ability to obtain rich semantic information. Thanks to this, style transfer based on deep learning has made a series of progress.

In contrast, image-based style transfer is simpler than face-specific style transfer because the latter needs to pay more attention to the refinement of detail features and the integration of style. Face-specific style commonly includes sketch stylization, cartoon stylization and oil painting stylization from literature. In sketch stylization, facial feature information is mainly retained, while irrelevant colors are removed or converted into gray values; In cartoon stylization and oil painting stylization, since the result is considerably different from the original portrait, we need to preserve the original facial feature information and learn the color and texture. What is more, the scarcity of data is one of the challenges for stylization while many stylization algorithms are exceedingly dependent on dataset. When we encounter interesting style references that have not been incorporated into the dataset, we need new methods for stylization.

In this paper, we propose StyleIdentityGAN to automatically generate portraits of specified artistic style. The proposed StyleIdentityGAN model can ensure the identity and artistry of the generated portrait at the same time. The proposed method requires only a few reference style samples. Then, the style-enhanced module can decoupling artistic style features, such as shading tone, color, texture features from reference style samples. Based on the decoupled artistic style features that corresponds to the latest space, the artistic style features can be transferred to the required results, which can improve the artistry of generated portraits. After the artistry can be guaranteed, our identity-enhanced module seeks to preserve the identity information of the original face. We introduce a feature loss function to preserves the significant features of the input photo to guarantee identity characteristics. We conduct our experiments on 4 datasets including cartoon and sketch style. The qualitative and quantitative evaluation proved that the portrait results of our proposed StyleIdentityGAN exhibits good aesthetics and assimilate with the original face. Our contributions can be summarized as follows:

(1)

We decouple the stylized texture features from stylized image domain to obtain the art style independent features. The decoupled style features can be obtained through transformation of the specified style latent space to enhance artistry. 2. (2)

The introduced feature loss emphasizes the importance of facial features, which solves the problem that the output results are easy to lose the facial features of the input image while ensuring the stylized effect. 3. (3)

Our few-shot strategy does not rely on large-scale data sets, which only needs a few pictures to start training. Moreover, the training is simple and fast. When generating a group of results with excellent effect, only a few minutes are needed.

2 Related Work

2.1 Image-based Style Transfer

Image-based style transfer can be divided into two groups: (1) Image based iteration: Gatys et al. proposed the most primitive style transfer algorithm [3], which uses convolutional neural network to extract features, followed by texture synthesis, calculation of content loss and style loss, gradient descent to optimize the total loss, and constantly iterating the image to generate an artistic image. Afterwards, Liao proposed visual attribute transfer [14] by combining deep VGG19 and patchmatch and achieved good results. (2) Model based iteration: One of them is based on the back propagation stylized model, which is mainly optimized in the model architecture. The other is based on GAN (generative adversarial networks). Johnson firstly proposed a real-time style transfer [11] based on iterative optimization generation model, which uses perceptual loss function to train the generation model for a specific style, providing a good idea for improving the efficiency of image style transfer. Zhu et al. proposed CycleGAN [24], which eliminates the need for specific image pairs when converting images between different fields, and solves the problem of difficult pair data collection. Moreover, UNIT [15] which combines GAN and VAE, suites for the mutual conversion between real images and StarGAN realizes the conversion of multiple fields through one model.

2.2 Face-specific Style Transfer

Face-specific style transfer based on GAN model have emerged enormously as the promotion of image-based style transfer and the gradual enrichment of face stylization database resources. For example, on the APDrawing line drawing portrait dataset, there are APDrawingGAN [22] integrated with global and local generators, asymmetric cycle-structure GAN, and U2-Net [17] generating portraits with plenty of details. On the WebCaricature dataset [6], the CariGANs [1], a network integrating CarigeoGAN, and CaristyGAN, with the semantic-CariGANs of learning comics through semantic shape transformation are proposed.

In recent years, more and more research work specific to cartoon and cartoon style face migration have emerged, representatives include white box cartoon feature representation combined with VGG and GAN and U-GAT-IT [13] introducing attention mechanism and adopting AdaLIN. Based on the U-GAT-IT, Zhuang and Yang [25] use Soft-AdaLIN to transfer information on unpaired data to generate more refined animation faces. In addition, based on the pre-training model of stylegan2 [12], AgileGAN [18] applies enormous samples for fine tuning to achieve the generation results of comics and other styles, the artistry of which has been improved compared with the previous method. However, due to the small-scale dataset, the current stylization still has potential for improvement. In order to solve the problem, some one-shot characterized methods are proposed, such as DiFa [23] and JoJoGAN [2].

2.3 Face Editing

Face editing aims to manipulate certain attributes of the face image to generate series of new faces with the required attributes while preserving other details. Face editing is always based on face generation. Although using face key points to control faces can also realize face attribute editing, that often leads to face distortion and asymmetry. With the development of generation model, face editing has made profound progress. GAN realizes the mapping of images from one domain to another, which brings the emergence of high-quality papers on face editing. For example, AttGAN [4] introduces attributive classification constraint, which could protect other non-edited attributes while ensuring the high quality of the generated image. SGGAN [21] proposes a novel piece-wise guided Generative Adversarial Network, which uses semantic segmentation to further improve the generation performance and provide spatial mapping. SC-FEGAN [10] takes another step in face editing, and realizes free face editing through the common input network of free-form original pictures, sketches, mask images, color images and noise. Face editing realized in this way will no longer be limited to attribute tags, but hand over the freedom of face editing to users, and realize face editing through joint input and SN-PatchGAN [23]. PSGAN [7] proposes MDNet and AMM modules to migrate the makeup on any reference image to the source image without makeup. Jin et al. [8] proposed a method by aesthetics driven reinforcement learning.

3 StyleIdentityGAN

3.1 Problem Definition

Suppose that $\mathcal{P}$ is the face image domain and $\mathcal{S}$ is the stylized image domain. For any input face image $p\in\mathcal{P}$ , the corresponding stylized result is $\tilde{p}\cong s,s\in\mathcal{S}$ , where $s$ is the specified artistic style sample image to be migrated. The goal of stylized portrait generation, which takes into account the complex structure and artistry of faces, is to build a mapping $G$ from the face image domain to the stylized image domain. Consequently, the generated $\tilde{p}$ still retains the identity information of the original human face $p$ , that is, the structural and semantic features of the original face remain unchanged, but the style is as close as possible to the given sample $s$ . The overview of our StyleIdentityGAN model is shown in Fig. 1. The style-enhanced module and identity-enhanced module are described in Sec 3.2 and Sec 3.3, respectively.

3.2 Style-enhanced Module

We first decompose the stylized texture features from stylized image domain $\mathcal{S}$ . Suppose the feature of the designated artistic reference is $\omega_{s}$ , and then the style-enhanced module decouples the artistic style related features and artistic style independent features according to the stylization domain $\mathcal{S}$ . Define the art style related feature as $\omega^{re}_{s}$ , and the art style independent feature as $\omega^{un}_{s}$ . After decoupling operation $F$ to $\omega_{s}$ , the art style independent feature $\omega^{re}_{s}$ can be obtained: $\omega^{re}_{s}=\omega_{s}-\omega^{un}_{s}$ . Because it depends on the style field $\mathcal{S}$ , the $\omega^{re}_{s}$ of the specified style can be obtained from the field transformation $T$ . The style features enhanced by artistic features are obtained through transformation.

We consider using few-shot strategy to eliminate the impact of single reference on the results on contrast with JoJoGAN’s [2] one-shot strategy. Compared with single reference, few-shot is easier to eliminate some effects of feature on the results. Given an input face photo $p$ and a small group of reference sample images $s_{1},...,s_{n}$ , we deduce the corresponding latest space $\omega$ by using StyleGAN inversion for $p$ and $s_{i}$ , respectively. The latent space $\omega$ has the function of encoding the content code of the main semantics in the image. Next, we randomly generate a tensor $\omega^{re}_{rand}$ with the same size as $\omega^{re}_{s}$ , and then the new $\omega_{s}$ is:

[TABLE]

Acoording to Eq. 1, we can generate new images $S_{1}^{{}^{\prime}},...,S_{n}^{{}^{\prime}}$ and $p^{{}^{\prime}}$ by passing the new $\omega$ through a network $G$ , which is originally StyleGAN. These images still retain the facial features of the original image, but change the image style, as shown in Figure. Finally, our destination is to obtain a network that can only change style without changing facial feature, so we loop steps 2 and 3 to train net $G$ .

3.3 Identity-enhanced Module

We notice that during the training process, the skin color and facial features of the generated image will gradually become close to the reference, so we hope to take some actions to suppress this change. The method we use is to add the feature loss, which is the Learned Perceptual Image Patch Similarity(LPIPS) of the current generated image and the input, and we record it as $\mathcal{L}_{feature}$ , the total loss function is:

[TABLE]

4 Experiments

4.1 Datasets

We conduct experiments on 4 datasets in sketch stylization and cartoon stylization. For shading style sketch stylization, we use CUFS [19], which includes 606 photo-sketch pairs, is used to study face sketch synthesis and face sketch recognition. Another dataset is a line style sketch dataset APDrawing [22], which collects 140 pairs of facial photos and corresponding portraits. Besides, the cartoon stylization includes a dataset CAS-WACO collected by us of 50 face images with corresponding watercolor cartoon portraits drawn by professional artists, as well as a cartoon dataset of 317 cartoon face images from Toonify [16].

4.2 Implementation details

During the training, we set $\omega_{rand}$ and its hyperparameter $\alpha$ to 0.5. And we will adjust the $\lambda_{feature}$ , within the range of $[0.0005,0.003]$ , because we learned through a large number of experiments that lambda has the best generation effect in this range. According to experimental experience, some components of $\omega_{s}$ will have a greater impact on style, so we define an array swap $\_$ list=[7, 9, 11, 15, 16, 17], during the $\omega^{re}_{rand}$ generation, we do not change the values of other components, but only randomly generate swap $\_$ list value of vector. In sketch stylization, we will set epoch to $150$ , while the epoch of cartoon stylization is $500$ . The epoch of sketch stylization is relatively small because we found in the experiment that the loss function has converged and the image has a good result at about $100$ epoch, and the subsequent iteration has not significantly improved the result. We used Tesla P40 (24GB) GPU for training and each reference takes about 30 seconds to sketch stylization, but $2$ minutes to cartoon stylization.

4.3 Ablation Study

In the ablation study, we selected CAS-WACO dataset. Through a large number of experiments, we found that $\lambda_{feature}$ has a better effect in $[0.0005,0.003]$ . Therefore, we selected lambda values of $0.0005,0.001,0.002$ , and Fig. 2 as the experimental results in this experiment. We found that when $\lambda_{feature}$ is $0.001$ , the resulting graph most meets our requirements. Therefore, we set $\lambda_{feature}$ values to $1$ . Fig. 3 shows the impact of different reference numbers on the results in the following experiments. We selected 1, 2, 7, and 14 style maps respectively, of which 1, 2, and 7 were the same style maps as the input map, and 14 were 7 men and 7 women. Through the experimental results, we found that few shot would be more universal than one shot, and the results generated by one shot would be greatly affected by the facial features of the style map, which is not desired.

4.4 Comparable Experiments

4.4.1 Qualitative Results.

We compare our method with JoJoGAN [2] in watercolor cartoon style, and APDrawingGAN [22], U2Net [17] in line sketch style. Due to convergence difficulties and GPU memory limitations, those methods were not able to directly support 1024×1024 resolution, thus we kept their original sizes of 256×256 or 512×512 for training and up-sampled the output to 1024×1024 for comparison. From Fig. 4, it can be seen that our method successfully cartoonized subjects with visually pleasing results.

4.4.2 User Study.

We further quantify improvement in stylization quality to try different parameters through human evaluation. We conducted a perceptual user study in which 28 participants were shown stylization results from different methods and different parameters. Among them, 11 participants are good at portrait drawing, while 17 participants are not skilled in art drawing. They were asked to select the best cartoonized images. Each participant was firstly shown 11 photo-portrait pairs of watercolor cartoon style of different methods. Table 1 shows that results from our proposed method had the majority preference. Besides, the few-shot strategy outperforms one-shot strategy of JoJoGAN and our StyleIdentityGAN.

Then each participant was shown 11 photo-portrait pairs of Tonnify cartoon style of different strategies. The human evaluation results is shown in Fig. 6. Fig. 6 illustrates that portraits generated by one shot strategy of either style received fewer votes than portraits generated by few-shot strategy. It can be intuitively seen that our few-shot strategy has significant improvement over one-shot strategy on cartoon generation of Tonnify dataset.

4.4.3 Quantification Evaluation.

We choose the Fréchet Inception Distance (FID) score [5] to quantitatively evaluate the stylization results. FID measures the visual similarity and distribution between two datasets of images. Each method generated stylized images from the CelebA-HQ dataset as input, and we computed FID to train the cartoon dataset. We can see from Table 2 that our method also performed best on this metric compared with JoJoGAN [2] in watercolor cartoon style. Although it should be noted that since there are fewer than 5K images in the CelebA-HQ test set, FID scores may not be very reliable. Our results is quite equivalent to ones of APDrawingGAN [22] and U2Net [17] in line sketch style. Meanwhile, we need few reference images and short time.

Another metric to quantitatively evaluate generative quality is the Structural Similarity (SSIM) measure [20]. SSIM is a widely used metric which computes structure similarity, luminance and contrast comparison using a sliding window on the local patch.

4.5 More Results and Discussion

Fig. 8 shows some results of our method on CUFS sketch dataset. Fig. 8 shows results of our method using 4 reference style samples on Tonnify cartoon dataset. The various styles of Tonnify cartoon dataset cause artifacts on some local face components, e.g. ghosting on noses if using one-shot strategy. More comparison results are attached in the supplemental material.

From the above results and the qualitative and quantitative evaluations, we can see the advantages of few-shot over one-shot. The results obtained by few-shot are more natural and do not depend heavily on the reference. In sketch stylization, compared with APDrawingGAN and U2Net, our method has this equivalent evaluation result, but our method only requires few references and has extremely short training time, which is our advantage. In cartoon stylization, compared with JoJoGAN, our results will be better in both quantitative evaluation and subjective evaluation, and we retain more facial features.

5 Conclusion and Future Work

In this paper, we propose a StyleIdentityGAN model which consists of a style-enhanced module and an identity-enhanced module. The former solves the problem of over dependence of the generated results on one-shot, and improves the artistic perception from a few reference style samples, while the latter reasonably retains the facial features of the input. In the future work, we will try to focus on optimizing facial details, e.g. hair naturalness, attention mechanism or adding normalizing flows to the style-enhanced module. On the basis of the preliminary improvement of artistic effect, emotional semantic features can be added to the structural semantic space in the generation process, and the facial structural features are fine tuned, which can edit and change the expression of the generated virtual face, making the generated virtual person more emotional and vivid. Furthermore, a model combined with implicit emotion and pose features can be built to generate video-driven virtual facial animations.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Cao, K., Liao, J., Yuan, L.: Carigans: Unpaired photo-to-caricature translation. ACM Transactions on Graphics (TOG) 37 (6), 244.1–244.14 (2018)
2[2] Chong, M.J., Forsyth, D.: Jojogan: One shot face stylization. ar Xiv preprint ar Xiv:2112.11641 (2021)
3[3] Gatys, L., Ecker, A., Bethge, M.: A neural algorithm of artistic style. Journal of Vision 16 (12), 326–326 (2016), https://doi.org/10.1167/16.12.326 · doi ↗
4[4] He, Z., Zuo, W., Kan, M., Shan, S., Chen, X.: Attgan: Facial attribute editing by only changing what you want. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 28 (11), 5464–5478 (2019)
5[5] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. p. 6629–6640. NIPS’17, Curran Associates Inc., Red Hook, NY, USA (2017)
6[6] Huo, J., Li, W., Shi, Y., Gao, Y., Yin, H.: Webcaricature: a benchmark for caricature recognition. In: British Machine Vision Conference (BMVC) (2017)
7[7] Jiang, W., Liu, S., Gao, C., Cao, J., He, R., Feng, J., Yan, S.: Psgan: Pose and expression robust spatial-aware gan for customizable makeup transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5194–5202 (2020)
8[8] Jin, X., Zhao, S., Zhang, L., Zhao, X., Deng, Q., Xiao, C.: Attribute controllable beautiful caucasian face generation by aesthetics driven reinforcement learning. In: ACM Multimedia, Technical Demos and Videos Program (2022)