World2Act: Latent Action Post-Training via Skill-Compositional World Models

An Dinh Vuong; Tuan Van Vo; Abdullah Sohail; Haoran Ding; Liang Ma; Xiaodan Liang; Anqing Duan; Ivan Laptev; Ian Reid

arXiv:2603.10422·cs.CV·March 12, 2026

World2Act: Latent Action Post-Training via Skill-Compositional World Models

An Dinh Vuong, Tuan Van Vo, Abdullah Sohail, Haoran Ding, Liang Ma, Xiaodan Liang, Anqing Duan, Ivan Laptev, Ian Reid

PDF

Open Access

TL;DR

World2Act introduces a pixel-independent, contrastive learning framework for vision-language-action policies, leveraging skill decomposition and skill-compositional world models to enhance robustness and generalization in embodied agents.

Contribution

The paper presents a novel post-training approach that aligns actions with world model latents and introduces an LLM-based skill decomposition pipeline for flexible, temporally consistent skill execution.

Findings

01

Achieves state-of-the-art results on RoboCasa and LIBERO benchmarks.

02

Improves real-world embodied agent performance by 6.7%.

03

Enhances generalization and robustness of vision-language-action policies.

Abstract

World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis