TL;DR
UNCOM is a hybrid framework enabling zero-shot, context-aware interpretation of natural commands for robots in tabletop scenarios, integrating speech, gestures, and scene context.
Contribution
It introduces a modular, explainable system that operates without task-specific training data, combining multiple modalities for robust human-robot interaction.
Findings
Achieved 82.39% success rate on real-world interaction data
Demonstrated robustness to noise, diversity, and ambiguity
Provided publicly available dataset and code for future research
Abstract
This paper presents UNCOM, a novel hybrid framework for interpreting natural human commands in tabletop scenarios. The system integrates multiple sources of information -- speech, gestures, and scene context -- to extract structured, actionable instructions for robots. Addressing the need for general-purpose human-robot interaction in domestic environments, UNCOM is designed for zero-shot operation, without reliance on predefined object models or training data specific to a given task. Using foundational and task-specific deep learning models, it allows out-of-the-box speech recognition, natural language understanding, gesture detection, and object segmentation. The modular architecture enhances transparency and explainability by explicitly parsing commands into object-action-target representations, enabling integration with symbolic robotic frameworks. We demonstrate the system in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
