U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
Xiang Deng, Feng Gao, Yong Zhang, Youxin Pang, Xu Xiaoming, Zhuoliang Kang, Xiaoming Wei, Yebin Liu

TL;DR
U-Mind is a comprehensive real-time multimodal interaction system that integrates language, speech, motion, and video synthesis, enabling natural and synchronized communication for intelligent agents.
Contribution
It introduces a unified framework with novel alignment and reasoning strategies for high-quality, real-time multimodal dialogue and interaction.
Findings
Achieves state-of-the-art results on multimodal interaction tasks
Demonstrates effective cross-modal synchronization and reasoning
Enables expressive, synchronized visual feedback in real-time
Abstract
Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop. At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses two key challenges: enhancing cross-modal synchronization via a segment-wise alignment strategy, and preserving reasoning abilities through Rehearsal-Driven Learning. During inference, U-Mind adopts a text-first decoding pipeline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Generative Adversarial Networks and Image Synthesis
