MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation
Ming Chen, Liyuan Cui, Wenyuan Zhang, Haoxian Zhang, Yan Zhou, Xiaohan Li, Songlin Tang, Jiwen Liu, Borui Liao, Hejia Chen, Xiaoqiang Liu, Pengfei Wan

TL;DR
This paper presents MIDAS, a real-time, multimodal interactive digital human video generation framework that combines autoregressive modeling with multimodal control, enabling efficient, controllable, and coherent video synthesis in conversational scenarios.
Contribution
MIDAS introduces a novel autoregressive video generation framework that integrates multimodal inputs with minimal LLM modifications and employs a deep autoencoder for efficiency, advancing real-time interactive digital human synthesis.
Findings
Supports low-latency, high-efficiency video generation
Enables fine-grained multimodal control in synthesis
Demonstrates effectiveness in multilingual and conversational scenarios
Abstract
Recently, interactive digital human video generation has attracted widespread attention and achieved remarkable progress. However, building such a practical system that can interact with diverse input signals in real time remains challenging to existing methods, which often struggle with heavy computational cost and limited controllability. In this work, we introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner. With minimal modifications to a standard large language model (LLM), our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations to guide the denoising process of a diffusion head. To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
