A Unit Enhancement and Guidance Framework for Audio-Driven Avatar Video Generation

S.Z. Zhou; Y.B. Wang; J.F. Wu; T. Hu; J.N. Zhang

arXiv:2505.03603·cs.CV·June 13, 2025

A Unit Enhancement and Guidance Framework for Audio-Driven Avatar Video Generation

S.Z. Zhou, Y.B. Wang, J.F. Wu, T. Hu, J.N. Zhang

PDF

Open Access

TL;DR

This paper introduces PAHA, a novel framework for audio-driven avatar video generation that enhances regional guidance and consistency, significantly improving quality and efficiency over existing multi-stage methods.

Contribution

The paper proposes a new unit enhancement and guidance framework with two key methods, PAR and PCE, to improve visual quality and motion consistency in audio-driven avatar videos.

Findings

01

PAHA outperforms existing methods in audio-motion alignment.

02

The proposed classifiers improve regional consistency.

03

Experimental results validate the effectiveness of the framework.

Abstract

Audio-driven human animation technology is widely used in human-computer interaction, and the emergence of diffusion models has further advanced its development. Currently, most methods rely on multi-stage generation and intermediate representations, resulting in long inference time and issues with generation quality in specific foreground regions and audio-motion consistency. These shortcomings are primarily due to the lack of localized fine-grained supervised guidance. To address above challenges, we propose Parts-aware Audio-driven Human Animation, PAHA, a unit enhancement and guidance framework for audio-driven upper-body animation. We introduce two key methods: Parts-Aware Re-weighting (PAR) and Parts Consistency Enhancement (PCE). PAR dynamically adjusts regional training loss weights based on pose confidence scores, effectively improving visual quality. PCE constructs and trains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion