Astra: General Interactive World Model with Autoregressive Denoising
Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu

TL;DR
Astra is a versatile world model capable of generating long-term, interactive video predictions across diverse real-world scenarios by integrating autoregressive denoising, action-aware control, and dynamic routing of action modalities.
Contribution
The paper introduces Astra, a novel interactive world model with an autoregressive denoising architecture, action-aware adaptation, and dynamic action routing for diverse real-world applications.
Findings
Outperforms existing models in fidelity and long-range prediction.
Supports various interactions like exploration, manipulation, and camera control.
Demonstrates effectiveness across multiple datasets.
Abstract
Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we…
Peer Reviews
Decision·ICLR 2026 Poster
**Strength (1)**: The paper is well-organized. Authors explain the core ideas with clear diagrams and concrete algorithmic descriptions. **Strength (2)**: Astra is a single model across multi-modal action spaces, covering camera poses, keyboard/mouse inputs, and robot poses. **Strength (3)**: The proposed solutions exhibit several elegant and practical design choices: - The action-free guidance mechanism offers a simple, original mechanism to amplify action effects without heavy architectura
**Weakness (1)**: The definition and formulation of action signals are insufficiently specified. The paper does not clearly describe how different types of actions (e.g., camera poses, keyboard/mouse inputs, robot poses, etc.) are represented, parsed, and projected to the action encoder. **Weakness (2)**: A comparable method, YUME [1], is not discussed in Section 2 (Related Work). The paper does not clearly articulate how Astra differs from or improves upon YUME, which weakens presentation. **
1. This paper uses a lightweight action-aware adapter for precise action conditioning. 2. Astra achieves good responsiveness and is able to generate long, temporally coherent video sequences by employing a noise-as-mask strategy during training. 3. Astra employs a mixture of action experts to effectively adapt to diverse scenarios and handle various types of action inputs.
1. Mixture of action experts idea is similar to [1, 2] and action-aware adapter is similar to [3, 4]. Please provide a conceptual comparison with these reference. 2. The paper does not thoroughly analyze the underlying reasons why the noise-as-mask strategy enables the generation of long, temporally coherent video sequences. 3. The paper does not explain why the router network performs so well across diverse scenarios and with various types of action inputs. [1] Mixture of Action Expert Embeddi
1. The paper addresses the limitation of passive video generation by showing interactive world modeling where video synthesis is conditioned on external actions . 2. The framework proposes a single, general-purpose model by training on a diverse datasets of driving, robotics, exploration and handles heterogeneous action types via a Mixture of Action Experts. 3. The paper proposes a noisy memory training strategy which forces the model to reply on action signals and not over-rely on past visual i
1. The ACT-Adapter seems be to showing a minimal performance improvement in Table 2. The ablation study in Table 2 shows it provides a score of 0.669 on Instruction Following, while a cross attn. adapter achieves 0.642, suggesting the performance gain of the new adapter is relatively small. 2. The comparison to baseline methods in Table 1 does not reflect a fair comparison. a). Since Wan2.1 is the pre-trained backbone of Astra, it would be an ablation of the paper instead of baseline. b). Matrix
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Human Motion and Animation
