Towards Generalizable Robotic Manipulation in Dynamic Environments
Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang, Xiang Bai

TL;DR
This paper introduces DOMINO, a large-scale dataset and benchmark for dynamic manipulation, and proposes PUMA, a dynamics-aware VLA architecture that improves generalization and success rates in dynamic environments.
Contribution
The work provides a new dataset, benchmark, and a novel architecture for dynamic manipulation, advancing the capabilities of vision-language-action models in dynamic settings.
Findings
PUMA achieves a 6.3% success rate improvement over baselines.
Training on dynamic data enhances transferability to static tasks.
DOMINO enables systematic evaluation of VLAs in dynamic environments.
Abstract
Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
