ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually   Synced Facial Performer

Jiazhi Guan; Zhiliang Xu; Hang Zhou; Kaisiyuan Wang; Shengyi He,; Zhanwang Zhang; Borong Liang; Haocheng Feng; Errui Ding; Jingtuo Liu,; Jingdong Wang; Youjian Zhao; Ziwei Liu

arXiv:2408.03284·cs.CV·August 7, 2024

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He,, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu,, Jingdong Wang, Youjian Zhao, Ziwei Liu

PDF

Open Access

TL;DR

ReSyncer is a unified framework that rewires style-based generators with a style-injected Transformer to produce high-fidelity, versatile lip-synced facial videos from audio, supporting fast personalization, style transfer, and face swapping.

Contribution

The paper introduces ReSyncer, a novel approach that reconfigures style-based generators with a Transformer for synchronized audio-visual facial generation, enabling multiple applications.

Findings

01

Produces high-fidelity lip-synced videos from audio.

02

Supports fast personalized fine-tuning and face swapping.

03

Enables transfer of speaking styles.

Abstract

Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis

MethodsLinear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections