DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Haozhe Xie; Beichen Wen; Jiarui Zheng; Zhaoxi Chen; Fangzhou Hong; Haiwen Diao; Ziwei Liu

arXiv:2601.22153·cs.RO·January 30, 2026

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhou Hong, Haiwen Diao, Ziwei Liu

PDF

Open Access 3 Models 1 Datasets

TL;DR

DynamicVLA is a novel framework that advances dynamic object manipulation by integrating temporal reasoning, continuous inference, and efficient perception, supported by a new large-scale benchmark for training and evaluation.

Contribution

The paper introduces DynamicVLA, a compact vision-language-action model with innovative inference and perception strategies, along with the DOM benchmark for dynamic manipulation tasks.

Findings

01

Significantly faster response and adaptation to moving objects.

02

Enhanced perception and generalization in dynamic scenarios.

03

Large-scale synthetic and real-world dataset for training and evaluation.

Abstract

Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception-execution gap by enforcing temporally aligned action execution. To fill the missing foundation of dynamic manipulation data, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

hzxie/DOM
dataset· 33 dl
33 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Action Observation and Synchronization