BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks
Yixiang Chen, Peiyan Li, Jiabing Yang, Keji He, Xiangnan Wu, Yuan Xu, Kai Wang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang

TL;DR
BridgeV2W introduces a method to align coordinate-space actions with pixel-space videos using embodiment masks and a ControlNet-style pathway, improving video generation and enabling downstream robotics tasks across diverse embodiments.
Contribution
The paper presents BridgeV2W, a novel approach that converts coordinate actions into embodiment masks and integrates them into pretrained video models for unified, view-aware embodied world modeling.
Findings
Improves video generation quality over state-of-the-art methods.
Enhances view-specific conditioning for diverse camera viewpoints.
Demonstrates effectiveness in downstream robotics tasks like policy evaluation.
Abstract
Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large-scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate-space actions and pixel-space videos, sensitivity to camera viewpoint, and non-unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet-style pathway, which aligns the action control signals with predicted videos, adds view-specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
**S1:** Clear motivation and observation: large-scale pretrained video generation models suffer from three key limitations and if the action representation is transformed into a pixel-aligned mask that reflects the embodiments's actual motion, these limitations can be substantially mitigated. URDF and camera intrinsic and extrinsics provide a solid approach to tackle this. **S2:** Motion-centric training objective: the paper adds a flow-based motion loss that emphasizes dynamic task-relevan reg
**W1:** Mask-IoU evaluates alignment between segments of generated and ground‑truth frames. But because BridgeV2W is conditioned on URDF-rendered masks, the metric remains highly correlated with the conditioning signal and may not accurately model motion or contact. **W2:** The experiments labeled as "unseen camera viewpoint" give the method ground-truth camera intrinsics/extrinsics at test time and use the URDF to project a per-view robot mask that is injected into the video generator. This s
1. High originality: Rather than treating robot actions as abstract coordinate vectors (e.g., end-effector poses), the authors propose rendering them as pixel-aligned embodiment masks using readily available URDF models and camera parameters. This insight effectively reconciles the semantic and representational mismatch between low-dimensional control signals and high-dimensional video generation models. 2. Rigorous and thorough: Experiments span two diverse robotic platforms (single-arm DROID
1. Dependence on Precise Camera Calibration and URDF. The core embodiment mask generation pipeline assumes access to accurate camera intrinsics/extrinsics and a complete URDF model. While common in controlled lab settings (e.g., DROID, AgiBot-G1), this requirement severely limits applicability in real-world or human-in-the-loop scenarios where: - Camera calibration may drift or be unavailable (e.g., mobile phones, uncalibrated webcams), - URDFs may be missing (e.g., legacy industrial arms, so
1. The embodiment mask design elegantly bridges the gap between coordinate-space actions and pixel-space video prediction. 2. Consistent improvements across PSNR, SSIM, LPIPS, and especially FVD and Mask-IoU metrics on both datasets. Notable robustness in unseen-view and unseen-scene settings (Table 1). 3. The introduced flow-based motion loss is interesting, as it encourages learning from dynamic, task-relevant regions. 4. Demonstrates practical use for real-world policy evaluation and goal-con
1. The approach assumes access to precise URDFs and camera parameters, which may not hold for in-the-wild or human video data (although segmentation-based alternatives are mentioned). 2. The goal-conditioned manipulation tasks show modest performance (13/40 successes vs. 17/40 from VLA baselines), indicating that planning still struggles with complex motion or rotation-heavy actions. 3. How sensitive is BridgeV2W to inaccurate URDFs or camera calibration errors? Would learned or self-calibrated
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Multimodal Machine Learning Applications
