TL;DR
AstraNav-World introduces a unified probabilistic world model that jointly predicts future visuals and actions, enhancing embodied navigation in dynamic environments with improved accuracy and zero-shot real-world adaptation.
Contribution
It presents a novel end-to-end diffusion-based framework that tightly couples visual prediction and action planning, advancing the robustness and transferability of embodied navigation models.
Findings
Improved trajectory accuracy and success rates across benchmarks.
Tight vision-action coupling enhances prediction quality and policy reliability.
Zero-shot real-world adaptation without fine-tuning.
Abstract
Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled "envision-then-plan"…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
