U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation

Xiang Deng; Feng Gao; Yong Zhang; Youxin Pang; Xu Xiaoming; Zhuoliang Kang; Xiaoming Wei; Yebin Liu

arXiv:2602.23739·cs.CV·March 2, 2026

U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation

Xiang Deng, Feng Gao, Yong Zhang, Youxin Pang, Xu Xiaoming, Zhuoliang Kang, Xiaoming Wei, Yebin Liu

PDF

Open Access

TL;DR

U-Mind is a comprehensive real-time multimodal interaction system that integrates language, speech, motion, and video synthesis, enabling natural and synchronized communication for intelligent agents.

Contribution

It introduces a unified framework with novel alignment and reasoning strategies for high-quality, real-time multimodal dialogue and interaction.

Findings

01

Achieves state-of-the-art results on multimodal interaction tasks

02

Demonstrates effective cross-modal synchronization and reasoning

03

Enables expressive, synchronized visual feedback in real-time

Abstract

Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop. At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses two key challenges: enhancing cross-modal synchronization via a segment-wise alignment strategy, and preserving reasoning abilities through Rehearsal-Driven Learning. During inference, U-Mind adopts a text-first decoding pipeline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Generative Adversarial Networks and Image Synthesis