World2Act: Latent Action Post-Training via Skill-Compositional World Models
An Dinh Vuong, Tuan Van Vo, Abdullah Sohail, Haoran Ding, Liang Ma, Xiaodan Liang, Anqing Duan, Ivan Laptev, Ian Reid

TL;DR
World2Act introduces a pixel-independent, contrastive learning framework for vision-language-action policies, leveraging skill decomposition and skill-compositional world models to enhance robustness and generalization in embodied agents.
Contribution
The paper presents a novel post-training approach that aligns actions with world model latents and introduces an LLM-based skill decomposition pipeline for flexible, temporally consistent skill execution.
Findings
Achieves state-of-the-art results on RoboCasa and LIBERO benchmarks.
Improves real-world embodied agent performance by 6.7%.
Enhances generalization and robustness of vision-language-action policies.
Abstract
World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
