Video2StyleGAN: Encoding Video in Latent Space for Manipulation
Jiyang Yu, Jingen Liu, Jing Huang, Wei Zhang, Tao Mei

TL;DR
This paper introduces a novel method for encoding face videos into StyleGAN's latent space, ensuring temporal consistency and detailed motion capture, enabling real-time face video manipulation.
Contribution
The paper presents a transformer-based network that encodes videos into StyleGAN's latent space with temporal consistency and detailed motion encoding, outperforming existing single-image methods.
Findings
Achieves real-time processing at 66 fps.
Outperforms existing methods in face video manipulation.
Enables pose and expression control in 3D space.
Abstract
Many recent works have been proposed for face image editing by leveraging the latent space of pretrained GANs. However, few attempts have been made to directly apply them to videos, because 1) they do not guarantee temporal consistency, 2) their application is limited by their processing speed on videos, and 3) they cannot accurately encode details of face motion and expression. To this end, we propose a novel network to encode face videos into the latent space of StyleGAN for semantic face video manipulation. Based on the vision transformer, our network reuses the high-resolution portion of the latent vector to enforce temporal consistency. To capture subtle face motions and expressions, we design novel losses that involve sparse facial landmarks and dense 3D face mesh. We have thoroughly evaluated our approach and successfully demonstrated its application to various face video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Facial Nerve Paralysis Treatment and Research · Generative Adversarial Networks and Image Synthesis
MethodsStyleGAN · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · HuMan(Expedia)||How do I get a human at Expedia? · Dense Connections · Convolution · R1 Regularization · Feedforward Network · Adaptive Instance Normalization
