ChronoDreamer: Action-Conditioned World Model as an Online Simulator for Robotic Planning
Zhenhao Zhou, Dan Negrut

TL;DR
ChronoDreamer is an action-conditioned world model that predicts future states in robotic manipulation, enabling safe planning by evaluating collision likelihood with a vision-language model.
Contribution
It introduces a novel spatial-temporal transformer-based world model with contact encoding and an LLM-based collision evaluator for robotic planning.
Findings
Accurately predicts contact-rich interactions in simulation.
Effectively distinguishes safe and unsafe trajectories.
Preserves spatial coherence during motion.
Abstract
We present ChronoDreamer, an action-conditioned world model for contact-rich robotic manipulation. Given a history of egocentric RGB frames, contact maps, actions, and joint states, ChronoDreamer predicts future video frames, contact distributions, and joint angles via a spatial-temporal transformer trained with MaskGIT-style masked prediction. Contact is encoded as depth-weighted Gaussian splat images that render 3D forces into a camera-aligned format suitable for vision backbones. At inference, predicted rollouts are evaluated by a vision-language model that reasons about collision likelihood, enabling rejection sampling of unsafe actions before execution. We train and evaluate on DreamerBench, a simulation dataset generated with Project Chrono that provides synchronized RGB, contact splat, proprioception, and physics annotations across rigid and deformable object scenarios.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Social Robot Interaction and HRI · Human Pose and Action Recognition
