Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang; Shangru Li; Shuhan Wang; Xuanyang Xi; Dingkang Liang; Xiang Bai

arXiv:2603.15620·cs.CV·April 16, 2026

Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang, Xiang Bai

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces DOMINO, a large-scale dataset and benchmark for dynamic manipulation, and proposes PUMA, a dynamics-aware VLA architecture that improves generalization and success rates in dynamic environments.

Contribution

The work provides a new dataset, benchmark, and a novel architecture for dynamic manipulation, advancing the capabilities of vision-language-action models in dynamic settings.

Findings

01

PUMA achieves a 6.3% success rate improvement over baselines.

02

Training on dynamic data enhances transferability to static tasks.

03

DOMINO enables systematic evaluation of VLAs in dynamic environments.

Abstract

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

H-EmbodVis/DOMINO
github

Models

🤗
H-EmbodVis/PUMA
model· 8 dl· ♡ 1
8 dl♡ 1

Datasets

H-EmbodVis/DOMINO
dataset· 2.7k dl
2.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.