Sonic: Shifting Focus to Global Audio Perception in Portrait Animation

Xiaozhong Ji; Xiaobin Hu; Zhihong Xu; Junwei Zhu; Chuming Lin; Qingdong He; Jiangning Zhang; Donghao Luo; Yi Chen; Qin Lin; Qinglin Lu; Chengjie Wang

arXiv:2411.16331·cs.MM·June 6, 2025

Sonic: Shifting Focus to Global Audio Perception in Portrait Animation

Xiaozhong Ji, Xiaobin Hu, Zhihong Xu, Junwei Zhu, Chuming Lin, Qingdong He, Jiangning Zhang, Donghao Luo, Yi Chen, Qin Lin, Qinglin Lu, Chengjie Wang

PDF

Open Access 2 Models

TL;DR

Sonic introduces a novel audio-driven approach for portrait animation that emphasizes global audio perception, disentangling intra- and inter-clip audio knowledge to improve naturalness, consistency, and lip synchronization in generated videos.

Contribution

The paper proposes a new paradigm, Sonic, focusing on global audio perception and disentangling intra- and inter-clip audio features for enhanced portrait animation.

Findings

01

Outperforms state-of-the-art methods in video quality and temporal consistency.

02

Achieves superior lip synchronization accuracy.

03

Enhances motion diversity in generated animations.

Abstract

The study of talking face generation mainly explores the intricacies of synchronizing facial movements and crafting visually appealing, temporally-coherent animations. However, due to the limited exploration of global audio perception, current approaches predominantly employ auxiliary visual and spatial knowledge to stabilize the movements, which often results in the deterioration of the naturalness and temporal inconsistencies.Considering the essence of audio-driven animation, the audio signal serves as the ideal and unique priors to adjust facial expressions and lip movements, without resorting to interference of any visual signals. Based on this motivation, we propose a novel paradigm, dubbed as Sonic, to {s}hift f{o}cus on the exploration of global audio per{c}ept{i}o{n}.To effectively leverage global audio knowledge, we disentangle it into intra- and inter-clip audio perception and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedia, Gender, and Advertising · Visual Culture and Art Theory

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings