Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

Jiaxuan Gao; Yongjian Guo; Zhong Guan; Wen Huang; Wanlun Ma; Xi Xiao; Junwu Xiong; Sheng Wen

arXiv:2605.07288·cs.CV·May 11, 2026

Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

Jiaxuan Gao, Yongjian Guo, Zhong Guan, Wen Huang, Wanlun Ma, Xi Xiao, Junwu Xiong, Sheng Wen

PDF

TL;DR

Sword introduces a robust World Model framework that enhances generalization and fidelity in vision-language-action tasks by disentangling visual textures and maintaining consistency through dynamic latent bootstrapping.

Contribution

The paper proposes Structure-Guided Style Augmentation and Dynamic Latent Bootstrapping to improve World Models' robustness and generalization in VLA policy training.

Findings

01

Outperforms baseline WoVR on LIBERO benchmark in generalization and robustness.

02

Significantly improves generation quality and fidelity of simulated environments.

03

Increases success rate of reinforcement-learning post-training for VLA models.

Abstract

The integration of Vision-Language-Action (VLA) models with World Models has gained increasing attention. One representative approach treats learned World Models as generative simulators, enabling policy optimization entirely within "imagination." However, when deployed as simulators for specific environments such as the LIBERO benchmark, existing World Models often suffer from poor generalization and long-horizon error accumulation. During closed-loop rollouts, these models are highly sensitive to initial-state perturbations; minor changes in color, illumination, and other visual factors can trigger cascading hallucinations, leading to severe blurriness or overexposure. Moreover, long-horizon error accumulation further degrades the quality and fidelity of predicted future states. These issues limit the reliability of World Models as simulators. To mitigate these problems, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.