RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Siyin Wang, Jinlan Fu, Feihong Liu, Xinzhe He, Huangxuan Wu, Junhao Shi, Kexin Huang, Zhaoye Fei, Jingjing Gong, Zuxuan Wu, Yu-Gang Jiang, See-Kiong Ng, Tat-Seng Chua, Xipeng Qiu

TL;DR
RoboOmni introduces a proactive, omni-modal framework for robotic manipulation that infers user intentions from multimodal cues like speech, sounds, and visuals, enabling more natural human-robot collaboration.
Contribution
The paper presents RoboOmni, a novel end-to-end omni-modal LLM framework that unifies intention recognition, interaction, and action in robotic manipulation, addressing the lack of proactive intention inference.
Findings
RoboOmni outperforms baselines in success rate and inference speed.
It effectively fuses auditory and visual signals for intention recognition.
Experiments demonstrate robustness in simulation and real-world scenarios.
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention…
Peer Reviews
Decision·ICLR 2026 Poster
1. Introduces a novel and practical setting, cross-modal contextual instructions, reflecting more natural human-robot interaction without explicit commands. 2. Proposes RoboOmni, a unified framework that integrates intention recognition, interaction confirmation, and action execution using multimodal large language models. 3. Builds a large, diverse dataset (OmniAction) with rich multimodal signals to support training and evaluation. 4. Demonstrates strong empirical performance, outperforming
1. Generalization Beyond Scripted Contextual Cues: A primary concern is the potential for the model to overfit to the specific structures of the six contextual instruction types synthesized for the OmniAction dataset. Since the dataset was generated by prompting GPT-4o to convert atomic instructions into structured dialogues, the model may be learning to recognize these semi-scripted patterns rather than developing a more general, robust capability for open-world intent inference. The impressive
- Clear writing and well-structured presentation. - Promised release of a robotic dataset, extensive experiments across diverse tasks, and large-scale training. - An “omni” framework for cross-modal contextual instructions that explicitly includes environmental event/background sounds; comparisons to ASR-based pipelines highlight the benefits of an end-to-end model that can handle overlapping speech. - Strong empirical results, with ≈60% average success rates exceeding other baselines.
- Limited discussion of the method and training for fusing multimodal sensory inputs. - Ablations on modality contributions are missing: it remains unclear how much each cue (prosody/identity, non-verbal audio, vision) contributes. Please add drop-modality ablations (e.g., audio-w/o-prosody, no non-verbal, no vision) and alignment-window studies; current results separate instruction types but not modalities within the architecture. - Task descriptions are brief; it is hard to understand the ch
- The scenarios are well-chosen, practically relevant, and aligned with next-generation robots operating in realistic human-robot interaction settings. - The constructed dataset is likely to be valuable for the community, providing broad multimodal supervision that can stimulate research on proactive robotic behavior.
- Neither the dataset’s realism/relevance nor the model’s performance appears to be evaluated by humans. This limits claims about practical usability and interaction quality. While Section 5.3 covers direct human audio instructions, this differs from the complex multimodal contexts central to the paper’s claims. - The dataset may inherit biases from off-the-shelf models used during curation. This paper does not analyze or mitigate such biases. - The setup, tuning, and fairness of baseline comp
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
