GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with   Audio-Driven 3D Gaussian Splatting

Kyusun Cho; Joungbin Lee; Heeji Yoon; Yeobin Hong; Jaehoon Ko; Sangjun; Ahn; Seungryong Kim

arXiv:2404.16012·cs.CV·April 26, 2024·1 cites

GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting

Kyusun Cho, Joungbin Lee, Heeji Yoon, Yeobin Hong, Jaehoon Ko, Sangjun, Ahn, Seungryong Kim

PDF

Open Access 1 Repo

TL;DR

GaussianTalker introduces a real-time, high-fidelity talking head synthesis framework that uses 3D Gaussian Splatting and audio-driven deformation, achieving superior facial fidelity, lip sync, and 120 FPS rendering speed.

Contribution

It presents a novel method combining 3D Gaussian Splatting with audio features for stable, real-time talking head generation with pose control.

Findings

01

Achieves up to 120 FPS rendering speed.

02

Outperforms previous methods in facial fidelity and lip synchronization.

03

Uses a shared implicit feature representation for Gaussian attributes.

Abstract

We propose GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial-aware features and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. It is more stable than previous concatenation or multiplication approaches for manipulating the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ku-cvlab/gaussiantalker
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTactile and Sensory Interactions · Robotics and Automated Systems · Hand Gesture Recognition Systems

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings