ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars

Ziqiao Peng; Yi Chen; Yifeng Ma; Guozhen Zhang; Zhiyao Sun; Zixiang Zhou; Youliang Zhang; Zhengguang Zhou; Zhaoxin Fan; Hongyan Liu; Yuan Zhou; Qinglin Lu; Jun He

arXiv:2512.19546·cs.CV·January 21, 2026

ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars

Ziqiao Peng, Yi Chen, Yifeng Ma, Guozhen Zhang, Zhiyao Sun, Zixiang Zhou, Youliang Zhang, Zhengguang Zhou, Zhaoxin Fan, Hongyan Liu, Yuan Zhou, Qinglin Lu, Jun He

PDF

Open Access

TL;DR

ActAvatar is a novel framework that achieves precise, temporally-aligned action control in talking avatars using text guidance, phase-aware attention, and hierarchical audio-visual alignment, outperforming existing methods.

Contribution

It introduces phase-aware cross-attention, hierarchical audio-visual alignment, and a two-stage training strategy for improved action control and lip synchronization.

Findings

01

Outperforms state-of-the-art in action control accuracy

02

Enhances lip synchronization quality

03

Maintains strong text-following capabilities

Abstract

Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis