ARIG: Autoregressive Interactive Head Generation for Real-time Conversations

Ying Guo; Xi Liu; Cheng Zhen; Pengfei Yan; Xiaoming Wei

arXiv:2507.00472·cs.CV·July 2, 2025

ARIG: Autoregressive Interactive Head Generation for Real-time Conversations

Ying Guo, Xi Liu, Cheng Zhen, Pengfei Yan, Xiaoming Wei

PDF

Open Access

TL;DR

ARIG introduces an autoregressive, frame-wise framework for real-time, realistic virtual agent head generation, leveraging continuous motion prediction and contextual understanding for improved interaction quality.

Contribution

It proposes a novel autoregressive, diffusion-based motion prediction framework with enhanced behavioral and conversational understanding for real-time virtual agent interactions.

Findings

01

Achieves more accurate, continuous motion predictions.

02

Enhances interaction realism through behavioral understanding.

03

Demonstrates effectiveness in real-time conversational scenarios.

Abstract

Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic. In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Multimodal Machine Learning Applications · Emotion and Mood Recognition