A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model
Xiaolin Hu, Hang Yuan, Xinzhu Sang, Binbin Yan, Zhou Yu, Cong Huang, Kai Chen

TL;DR
A$^2$-LLM is an end-to-end multimodal model that enhances conversational digital humans by jointly reasoning about language, audio, and facial expressions, achieving real-time, emotionally expressive interactions.
Contribution
It introduces A$^2$-LLM, a unified framework for conversational avatars that integrates language, audio prosody, and facial motion, along with the FLAME-QA dataset for training.
Findings
Achieves real-time performance with 500 ms latency and 0.7 RTF.
Generates emotionally rich facial expressions beyond lip-sync.
Outperforms cascaded systems in emotional expressiveness.
Abstract
Developing expressive and responsive conversational digital humans is a cornerstone of next-generation human-computer interaction. While large language models (LLMs) have significantly enhanced dialogue capabilities, most current systems still rely on cascaded architectures that connect independent modules. These pipelines are often plagued by accumulated errors, high latency, and poor real-time performance. Lacking access to the underlying conversational context, these pipelines inherently prioritize rigid lip-sync over emotional depth. To address these challenges, we propose A-LLM, an end-to-end conversational audio avatar large language model that jointly reasons about language, audio prosody, and 3D facial motion within a unified framework. To facilitate training, we introduce FLAME-QA, a high-quality multimodal dataset designed to align semantic intent with expressive facial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
