ARIG: Autoregressive Interactive Head Generation for Real-time Conversations
Ying Guo, Xi Liu, Cheng Zhen, Pengfei Yan, Xiaoming Wei

TL;DR
ARIG introduces an autoregressive, frame-wise framework for real-time, realistic virtual agent head generation, leveraging continuous motion prediction and contextual understanding for improved interaction quality.
Contribution
It proposes a novel autoregressive, diffusion-based motion prediction framework with enhanced behavioral and conversational understanding for real-time virtual agent interactions.
Findings
Achieves more accurate, continuous motion predictions.
Enhances interaction realism through behavioral understanding.
Demonstrates effectiveness in real-time conversational scenarios.
Abstract
Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic. In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Multimodal Machine Learning Applications · Emotion and Mood Recognition
