ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars
Ziqiao Peng, Yi Chen, Yifeng Ma, Guozhen Zhang, Zhiyao Sun, Zixiang Zhou, Youliang Zhang, Zhengguang Zhou, Zhaoxin Fan, Hongyan Liu, Yuan Zhou, Qinglin Lu, Jun He

TL;DR
ActAvatar is a novel framework that achieves precise, temporally-aligned action control in talking avatars using text guidance, phase-aware attention, and hierarchical audio-visual alignment, outperforming existing methods.
Contribution
It introduces phase-aware cross-attention, hierarchical audio-visual alignment, and a two-stage training strategy for improved action control and lip synchronization.
Findings
Outperforms state-of-the-art in action control accuracy
Enhances lip synchronization quality
Maintains strong text-following capabilities
Abstract
Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis
