Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, Jinwei Gu

TL;DR
Cosmos Policy introduces a simple, single-stage method to adapt pretrained video models into effective robot control policies, enabling high success rates in simulation and real-world tasks without architectural changes.
Contribution
It presents a novel approach that fine-tunes a large pretrained video model for robot policy learning through one post-training step, avoiding complex multi-stage processes and architectural modifications.
Findings
Achieves state-of-the-art success rates on LIBERO and RoboCasa benchmarks.
Outperforms existing diffusion and vision-language models in manipulation tasks.
Enables test-time planning with higher success likelihood.
Abstract
Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action…
Peer Reviews
Decision·ICLR 2026 Poster
I like the idea of repurposing a video generation model (that has already learned spatio-temporal predictions) for other spatio-temporal prediction tasks, in this case robot action/value data. The evidence provided by the paper that this simple idea works is noteworthy beyond just the numbers. The numbers themselves are impressive where fine-tuning the 1.2B-parameter Cosmos-Predict2 model on just a few hundred robot demonstrations yields 98.5% task success on the LIBERO benchmark, outperforming
1. It would have been nice to see specific numbers on how well the state, actions, and value functions are predicted using this model. 2. It is also not clear how well this would generalize standard tasks (that may be in the pretraining of the underlying model.. even though I fully agree that the action data is not). But seeing more generalizability experiments would have been nice. 3. It was not clear how well does this work for longer-horizon tasks.
- Strong empirical performances across the evaluation benchmarks, even existing methods which already rely on video generative models. - The proposed idea is simple yet effective, and ebles joint training of policy, world model and value function within the same design. - Sufficient analyses and ablation results (e.g., w/o auxiliary losses, Q_sa and V_s variants) which show interesting and significant results.
- I understand that the additional modalities are encoded as additional latents, but I still can't understand exactly how. I can understand from Figure 1 that the different modalities are interleaved, and the current state frames are given as conditioning inputs - but it is hard to understand where the original latents remain, and where the additional modalities are input. - Without the pretrained model, the performance of Cosmos Policy falls below that of CogVLA. What happens if the same post
- The main strength of this work is in using the video generation model to output actions with only a single fine-tuning stage without any architectural changes to the model - this is in contrast to other approaches that employ architectural changes like adding inverse dynamics model or doing multiple stages of post-training to generate actions. - The proposed approach enables model-based planning where multiple action proposals can be sampled from the policy and resulting states / values can b
- It's unclear why making architectural changes to the video generation model is seen as a weakness in other approaches. - Are the authors assuming the world model is on the state s (in Sec. 3)? They authors claim the world model predicts the state. World models predict observations and not states. Clearly making this distinction is important as the state is not completely observable. The method in the paper predicts value function as a function of the state. However, since we are only predi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis
