Vidar: Embodied Video Diffusion Model for Generalist Manipulation
Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, Jun Zhu

TL;DR
Vidar introduces a scalable embodied video diffusion model that, with minimal demonstrations, generalizes manipulation skills across diverse robot platforms and environments, leveraging large-scale pre-training and a novel masking approach.
Contribution
The paper presents Vidar, a novel embodied video diffusion model combined with a masked inverse dynamics module, enabling generalist robot manipulation with minimal data and broad generalization.
Findings
Outperforms state-of-the-art baselines with only 20 minutes of demonstrations.
Successfully generalizes to unseen tasks, backgrounds, and camera layouts.
Leverages large-scale pre-training and a unified observation space for embodiment adaptation.
Abstract
Scaling general-purpose manipulation to new robot embodiments remains challenging: each platform typically needs large, homogeneous demonstrations, and end-to-end pixel-to-action pipelines may degenerate under background and viewpoint shifts. Based on previous advances in video-based robot control, we present Vidar, consisting of an embodied video diffusion model as the generalizable prior and a masked inverse dynamics model (MIDM) as the adapter. We leverage a video diffusion model pre-trained at Internet scale, and further continuously pre-train it for the embodied domain using 750K multi-view trajectories collected from three real-world robot platforms. For this embodied pre-training, we introduce a unified observation space that jointly encodes robot, camera, task, and scene contexts. The MIDM module learns action-relevant pixel masks without dense labels, grounding the prior into…
Peer Reviews
Decision·Submitted to ICLR 2026
- The authors raise an important problem of data scarcity and cross-embodiment adaptation in generalist manipulation, and propose a highly effective prior-driven, low-shot adaptation paradigm to tackle it. - Moreover, the authors introduce the Masked Inverse Dynamics Model (MIDM), a lightweight and novel architectural component that implicitly learns spatial masks to focus on action-relevant regions, making the action decoding highly robust to visual distractors. - The proposed system demonstr
- The open-loop video generation process is currently computationally expensive and slow, requiring approximately 25 seconds on high-end GPUs, which is a significant barrier for real-time deployment. - The crucial performance improvement from Test-Time Scaling (TTS) relies on an external, large Vision-Language Model like GPT-4o for "physics-aware reranking." This dependency introduces an opaque, expensive, and potentially brittle component into the core control system. - Confusion in Presentat
- Pretraining on 750K multi-view trajectories from three robotic platforms is useful to learn a generalizable prior for manipulation tasks. - Using masks as an intermediate representation for predicting actions encourages the model to focus on embodiment-relevant features, as evident in Fig.3. - Experiments on the RoboTwin benchmark (Tab.1) show benefits over existing baselines like VPP and UniPi on both seen and unseen settings.
- The text has several mentions about pretraining/learning priors from internet videos (L017-018, L060-061, L062-063, L140-141). However, Fig.1 and the training details in Sec.3.1.2 do not mention any details about pretraining on internet videos. L062-063 explicitly states: "we propose a three-stage training pipeline, where Internet-scale videos are used for general pretraining". However, pretraining details on internet videos are clear from the paper. - How does the unified observation space (L
The paper is relatively well written and easy to follow. The proposed approach is sound. The masked inverse dynamics model has some non-trivial novelty. Strong results are reported in a real-world evaluation in a very low data regime. A minimal ablation study is reported.
The positioning of the paper is inaccurate in several ways: 1. The abstracts opens with this statement: “pixel-to-action VLA pipelines typically degenerate under background and view-point shifts”, which is simply incorrect for the latest pixel-to-action pipelines (as demonstrated later in the paper where the prosed approach performs worse than pixel-to-action Pi0 out of domain on the RoboTwin benchmark). 2. While the authors do discuss existing works on video generation + robot control in rel
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Neural Networks and Applications
