Any4D: Open-Prompt 4D Generation from Natural Language and Images

Hao Li; Qiao Sun

arXiv:2511.18746·cs.CV·March 30, 2026

Any4D: Open-Prompt 4D Generation from Natural Language and Images

Hao Li, Qiao Sun

PDF

TL;DR

This paper introduces PEWM, a framework that enhances embodied world models by focusing on primitive, short-horizon video generation, enabling better language-action alignment, data efficiency, and flexible control.

Contribution

PEWM restricts video generation to short horizons, improving alignment, efficiency, and inference speed, and incorporates a modular VLM planner with a Start-Goal heatmap for complex task generalization.

Findings

01

Enables fine-grained language and action alignment.

02

Reduces learning complexity and data requirements.

03

Supports flexible, compositional control of primitive policies.

Abstract

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a \textit{"GPT moment"} in the embodied domain. There is a naive observation: \textit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose \textbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach \textit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.