PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, Li Fei-Fei

TL;DR
PointWorld is a large-scale 3D world model trained on extensive robotic manipulation data, capable of predicting 3D responses to actions from minimal visual input, enabling versatile real-world robotic manipulation without demonstrations.
Contribution
We introduce PointWorld, a novel 3D world model that unifies state and action as 3D point flows, trained on a large dataset, and demonstrates real-time manipulation capabilities in the wild.
Findings
Real-time inference at 0.1s per prediction.
Successful manipulation tasks without demonstrations.
Effective generalization across different robot embodiments.
Abstract
Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. PointWorld establishes a unified and scalable framework for 3D world modeling in robotic manipulation. 2. Introduces large 3D dynamics dataset (≈2M trajectories, 500 h) combining real DROID data and BEHAVIOR-1K simulation with precise metric depth and correspondences. 3. Demonstrates that large-scale 3D world models can act as foundation models bridging perception, physics, and control. 4. Enables model-predictive planning on real hardware for rigid, deformable, articulated, and tool-use task
1. No comparison with other robotic foundation models on zero-shot generalization capability 2. The pipeline to extract robotic action lack of the capability for challenging embodiment, e.g. dexterous hands, mobile arm ans so on. 3. The world model rely on imagination, but might not be perfect for action alignment and physically accuracy. 4. Rely on pre-existing 3d depth recovery, which might cause some errors. 5. This method can only serve for static camera, for dynamic camera, it cannot hand
- The paper proposes a world-modeling paradigm that represents both state and action in a shared point-cloud space, avoiding the limitations of low-dimensional physics simulators or voxel/mesh state encodings. Mapping RGB-D perception and robot kinematics to a joint point-flow prediction task yields a physically intuitive and highly extensible formulation. - The proposed metric-stereo depth estimation, automatic extrinsic calibration, and marker-free tracking pipeline address the long-standing i
- The paper proposes unifying states and actions within a 3D point-space representation, but it does not provide a clear physical or dynamical justification for this formulation. Equation (1) treats 3D point-flow prediction largely as a black-box function approximation, without discussing its theoretical advantages or stability properties. - The training objective combines Huber loss, motion weighting, and aleatoric uncertainty, yet the independent contribution and weighting strategy of each com
- The paper conducts a rigorous and systematic empirical study on design choices (backbone, loss function, features, and scaling). - The careful curation and open-sourcing of the large-scale 3D dynamics dataset is a substantial, high-quality technical contribution. - The paper is well-structured and the key components of the method and experiments are clearly explained. The use of tables and figures (e.g., the scaling roadmap in Figure 4 and the backbone comparisons in Table 1) effectively conve
- **Lack of Clear Motivation and Comparative Advantage**: The paper does not sufficiently articulate the clear advantage of this specific method compared to other large-scale dynamics or control approaches. A clearer motivation is needed to explicitly demonstrate why this 3D point-flow formulation is superior to established alternatives. - Lack of Downstream Control Baselines: While the paper provides architectural ablations for the predictive task (Table 1), the crucial zero-shot model-based pl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Motor Control and Adaptation
