A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model

Xiaolin Hu; Hang Yuan; Xinzhu Sang; Binbin Yan; Zhou Yu; Cong Huang; Kai Chen

arXiv:2602.04913·cs.LG·February 6, 2026

A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model

Xiaolin Hu, Hang Yuan, Xinzhu Sang, Binbin Yan, Zhou Yu, Cong Huang, Kai Chen

PDF

Open Access

TL;DR

A$^2$-LLM is an end-to-end multimodal model that enhances conversational digital humans by jointly reasoning about language, audio, and facial expressions, achieving real-time, emotionally expressive interactions.

Contribution

It introduces A$^2$-LLM, a unified framework for conversational avatars that integrates language, audio prosody, and facial motion, along with the FLAME-QA dataset for training.

Findings

01

Achieves real-time performance with 500 ms latency and 0.7 RTF.

02

Generates emotionally rich facial expressions beyond lip-sync.

03

Outperforms cascaded systems in emotional expressiveness.

Abstract

Developing expressive and responsive conversational digital humans is a cornerstone of next-generation human-computer interaction. While large language models (LLMs) have significantly enhanced dialogue capabilities, most current systems still rely on cascaded architectures that connect independent modules. These pipelines are often plagued by accumulated errors, high latency, and poor real-time performance. Lacking access to the underlying conversational context, these pipelines inherently prioritize rigid lip-sync over emotional depth. To address these challenges, we propose A $^{2}$ -LLM, an end-to-end conversational audio avatar large language model that jointly reasons about language, audio prosody, and 3D facial motion within a unified framework. To facilitate training, we introduce FLAME-QA, a high-quality multimodal dataset designed to align semantic intent with expressive facial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis