OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction

Shaqi Luo; Yuanyuan Li; Youhao Hu; Chenhao Yu; Chaoran Xu; Jiachen Zhang; Guocai Yao; Tiejun Huang; Ran He; Zhongyuan Wang

arXiv:2604.10647·cs.RO·May 6, 2026

OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction

Shaqi Luo, Yuanyuan Li, Youhao Hu, Chenhao Yu, Chaoran Xu, Jiachen Zhang, Guocai Yao, Tiejun Huang, Ran He, Zhongyuan Wang

PDF

TL;DR

OmniUMI introduces a multimodal robot learning framework that captures visual, tactile, and force data through a handheld system, enabling contact-rich manipulation tasks with human-aligned interaction.

Contribution

The paper presents OmniUMI, a unified system for physically grounded robot learning using multimodal sensing and natural human interaction, extending diffusion policies for contact-rich tasks.

Findings

01

Reliable sensing of contact and force signals demonstrated.

02

Strong performance in force-sensitive manipulation tasks.

03

Unified framework improves contact-rich manipulation capabilities.

Abstract

UMI-style interfaces enable scalable robot learning, but existing systems remain largely visuomotor, relying primarily on RGB observations and trajectory while providing only limited access to physical interaction signals. This becomes a fundamental limitation in contact-rich manipulation, where success depends on contact dynamics such as tactile interaction, internal grasping force, and external interaction wrench that are difficult to infer from vision alone. We present OmniUMI, a unified framework for physically grounded robot learning via human-aligned multimodal interaction. OmniUMI synchronously captures RGB, depth, trajectory, tactile sensing, internal grasping force, and external interaction wrench within a compact handheld system, while maintaining collection--deployment consistency through a shared embodiment design. To support human-aligned demonstration, OmniUMI enables…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.