ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

Yuzhi Chen; Ronghan Chen; Dongjie Huo; Yandan Yang; Dekang Qi; Haoyun Liu; Tong Lin; Shuang Zeng; Junjin Xiao; Xinyuan Chang; Feng Xiong; Xing Wei; Zhiheng Ma; and Mu Xu

arXiv:2603.23376·cs.CV·March 30, 2026

ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, and Mu Xu

PDF

1 Repo

TL;DR

ABot-PhysWorld is a large diffusion transformer model that generates realistic, physically plausible robotic manipulation videos, trained on a curated dataset with physics-aware annotations and evaluated on new benchmarks.

Contribution

The paper introduces a novel physics-aware training framework for a large diffusion model and a new benchmark for evaluating physical realism and action alignment in robotic videos.

Findings

01

Achieves state-of-the-art results on PBench and EZSbench benchmarks.

02

Surpasses previous models in physical plausibility and trajectory consistency.

03

Introduces a new zero-shot benchmark for embodied video generation evaluation.

Abstract

Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amap-cvlab/ABot-PhysWorld
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.