OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction
Shaqi Luo, Yuanyuan Li, Youhao Hu, Chenhao Yu, Chaoran Xu, Jiachen Zhang, Guocai Yao, Tiejun Huang, Ran He, Zhongyuan Wang

TL;DR
OmniUMI introduces a multimodal robot learning framework that captures visual, tactile, and force data through a handheld system, enabling contact-rich manipulation tasks with human-aligned interaction.
Contribution
The paper presents OmniUMI, a unified system for physically grounded robot learning using multimodal sensing and natural human interaction, extending diffusion policies for contact-rich tasks.
Findings
Reliable sensing of contact and force signals demonstrated.
Strong performance in force-sensitive manipulation tasks.
Unified framework improves contact-rich manipulation capabilities.
Abstract
UMI-style interfaces enable scalable robot learning, but existing systems remain largely visuomotor, relying primarily on RGB observations and trajectory while providing only limited access to physical interaction signals. This becomes a fundamental limitation in contact-rich manipulation, where success depends on contact dynamics such as tactile interaction, internal grasping force, and external interaction wrench that are difficult to infer from vision alone. We present OmniUMI, a unified framework for physically grounded robot learning via human-aligned multimodal interaction. OmniUMI synchronously captures RGB, depth, trajectory, tactile sensing, internal grasping force, and external interaction wrench within a compact handheld system, while maintaining collection--deployment consistency through a shared embodiment design. To support human-aligned demonstration, OmniUMI enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
