Any4D: Open-Prompt 4D Generation from Natural Language and Images
Hao Li, Qiao Sun

TL;DR
This paper introduces PEWM, a framework that enhances embodied world models by focusing on primitive, short-horizon video generation, enabling better language-action alignment, data efficiency, and flexible control.
Contribution
PEWM restricts video generation to short horizons, improving alignment, efficiency, and inference speed, and incorporates a modular VLM planner with a Start-Goal heatmap for complex task generalization.
Findings
Enables fine-grained language and action alignment.
Reduces learning complexity and data requirements.
Supports flexible, compositional control of primitive policies.
Abstract
While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a \textit{"GPT moment"} in the embodied domain. There is a naive observation: \textit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose \textbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach \textit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
