ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting
Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji, Liu, Yitao Liang

TL;DR
ROCKET-1 introduces a visual-temporal context prompting method that enhances vision-language models with spatial reasoning capabilities, enabling complex open-world interactions in Minecraft with significant performance improvements.
Contribution
The paper presents a novel communication protocol and training approach that allows VLMs to incorporate spatial and temporal context for improved decision-making in embodied environments.
Findings
Achieved 76% improvement in open-world interaction performance in Minecraft.
Enabled VLM-based agents to perform complex spatial reasoning tasks.
Demonstrated the effectiveness of visual-temporal context prompting in real-time environments.
Abstract
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning. A common solution is building hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language. However, language suffers from the inability to communicate detailed spatial information. We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics
