MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation

Ming Chen; Liyuan Cui; Wenyuan Zhang; Haoxian Zhang; Yan Zhou; Xiaohan Li; Songlin Tang; Jiwen Liu; Borui Liao; Hejia Chen; Xiaoqiang Liu; Pengfei Wan

arXiv:2508.19320·cs.CV·August 29, 2025

MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation

Ming Chen, Liyuan Cui, Wenyuan Zhang, Haoxian Zhang, Yan Zhou, Xiaohan Li, Songlin Tang, Jiwen Liu, Borui Liao, Hejia Chen, Xiaoqiang Liu, Pengfei Wan

PDF

TL;DR

This paper presents MIDAS, a real-time, multimodal interactive digital human video generation framework that combines autoregressive modeling with multimodal control, enabling efficient, controllable, and coherent video synthesis in conversational scenarios.

Contribution

MIDAS introduces a novel autoregressive video generation framework that integrates multimodal inputs with minimal LLM modifications and employs a deep autoencoder for efficiency, advancing real-time interactive digital human synthesis.

Findings

01

Supports low-latency, high-efficiency video generation

02

Enables fine-grained multimodal control in synthesis

03

Demonstrates effectiveness in multilingual and conversational scenarios

Abstract

Recently, interactive digital human video generation has attracted widespread attention and achieved remarkable progress. However, building such a practical system that can interact with diverse input signals in real time remains challenging to existing methods, which often struggle with heavy computational cost and limited controllability. In this work, we introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner. With minimal modifications to a standard large language model (LLM), our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations to guide the denoising process of a diffusion head. To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.