ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context   Prompting

Shaofei Cai; Zihao Wang; Kewei Lian; Zhancun Mu; Xiaojian Ma; Anji; Liu; Yitao Liang

arXiv:2410.17856·cs.CV·March 21, 2025

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji, Liu, Yitao Liang

PDF

Open Access 1 Repo 1 Models

TL;DR

ROCKET-1 introduces a visual-temporal context prompting method that enhances vision-language models with spatial reasoning capabilities, enabling complex open-world interactions in Minecraft with significant performance improvements.

Contribution

The paper presents a novel communication protocol and training approach that allows VLMs to incorporate spatial and temporal context for improved decision-making in embodied environments.

Findings

01

Achieved 76% improvement in open-world interaction performance in Minecraft.

02

Enabled VLM-based agents to perform complex spatial reasoning tasks.

03

Demonstrated the effectiveness of visual-temporal context prompting in real-time environments.

Abstract

Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning. A common solution is building hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language. However, language suffers from the inability to communicate detailed spatial information. We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CraftJarvis/ROCKET-1
pytorchOfficial

Models

🤗
phython96/ROCKET-1
model· 18 dl· ♡ 5
18 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics