VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

Xiaoyu Liu; Chaoyou Fu; Chi Yan; Chu Wu; Haihan Gao; Yi-Fan Zhang; Shaoqi Dong; Cheng Qian; Bin Luo; Xiuyong Yang; Guanwu Li; Yusheng Cai; Yunhang Shen; Deqiang Jiang; Haoyu Cao; Xing Sun; Caifeng Shan; and Ran He

arXiv:2510.21817·cs.RO·October 28, 2025

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

Xiaoyu Liu, Chaoyou Fu, Chi Yan, Chu Wu, Haihan Gao, Yi-Fan Zhang, Shaoqi Dong, Cheng Qian, Bin Luo, Xiuyong Yang, Guanwu Li, Yusheng Cai, Yunhang Shen, Deqiang Jiang, Haoyu Cao, Xing Sun, Caifeng Shan, and Ran He

PDF

TL;DR

VITA-E introduces a dual-model framework enabling embodied agents to see, hear, speak, and act concurrently with real-time interruption handling, advancing natural human-like multitasking in interactive scenarios.

Contribution

The paper presents a novel dual-model architecture and model-as-controller paradigm for concurrent, interruptible embodied interaction, improving responsiveness and multitasking capabilities.

Findings

01

High success rate in emergency stops and speech interruptions

02

Effective concurrent speech and action execution

03

Reliable handling of complex interactive scenarios

Abstract

Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an ``Active Model'' and a ``Standby Model'', allowing the embodied agent to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human-like multitasking capabilities. We further propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.