Mobile UMI: Cross-View Diffusion Policy with Decoupled Kinematics for Mobile Manipulation
Haoran Huang, Haonan Dong, Huixu Dong

TL;DR
Mobile UMI introduces a cross-view diffusion policy with decoupled kinematics, leveraging dual-camera capture and online state matching to improve mobile manipulation success rates without architectural policy changes.
Contribution
It presents a hardware-free framework combining dual-camera capture, spatial anchoring, and asynchronous execution to address coupled action labels and latency issues in mobile imitation learning.
Findings
Achieved 83.8% success rate on household tasks.
Chest-centric global context improves policy performance.
Online state matching effectively compensates for inference latency.
Abstract
Mobile imitation learning on portable demonstration interfaces faces two coupled bottlenecks: locomotion-contaminated action labels and inference-induced execution latency on a continuously moving base. Recent wrist-mounted interfaces lower the cost of tabletop data collection, yet a single wrist view does not capture the global context required for base navigation. Adding a body-mounted camera entangles human walking with hand motion. Meanwhile, generative policies introduce hundreds of milliseconds of inference latency, during which the base advances past predicted waypoints, forcing backward corrections at action splices. This paper presents Mobile UMI, a hardware-free demonstration framework that addresses both gaps through three components. First, a dual-camera capture system records chest-centric global context and wrist-centric local interaction without any robot present. Second,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
