Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation

Jisoo Kim; Jungbin Cho; Sanghyeok Chu; Ananya Bal; Jinhyung Kim; Gunhee Lee; Sihaeng Lee; Seung Hwan Kim; Bohyung Han; Hyunmin Lee; Laszlo A. Jeni; Seungryong Kim

arXiv:2603.01549·cs.CV·March 11, 2026

Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation

Jisoo Kim, Jungbin Cho, Sanghyeok Chu, Ananya Bal, Jinhyung Kim, Gunhee Lee, Sihaeng Lee, Seung Hwan Kim, Bohyung Han, Hyunmin Lee, Laszlo A. Jeni, Seungryong Kim

PDF

Open Access

TL;DR

Pri4R enhances vision-language-action models by integrating privileged 4D world dynamics during training, leading to improved manipulation performance without increasing inference complexity.

Contribution

Introduces Pri4R, a simple method that incorporates 3D point track prediction to embed world dynamics into VLA models using privileged 4D information during training.

Findings

01

+10% on LIBERO-Long manipulation tasks

02

+40% on RoboCasa manipulation tasks

03

Effective supervision of action-world dynamics through 3D point tracking

Abstract

Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robotics and Sensor-Based Localization