GROOT: Learning to Follow Instructions by Watching Gameplay Videos
Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao, Liang

TL;DR
GROOT is a novel controller that learns to follow open-ended instructions in open-world environments by watching gameplay videos, effectively bridging the gap between human and machine performance in complex tasks.
Contribution
It introduces a new learning framework and architecture for instruction following from videos, creating a structured goal space without needing text annotations.
Findings
GROOT closes the human-machine performance gap.
Achieves a 70% win rate over baseline agents.
Induces a goal space with emergent properties like goal composition.
Abstract
We study the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space. We implement our agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft SkillForge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal…
Peer Reviews
Decision·ICLR 2024 spotlight
- The overall paper is well written and the method is clearly presented. The proposed method is simple, and providing the ability to condition more flexibly on multi-step video-based goals is a meaningful contribution in the space of developing generalist agents, as there is an abundance of video-based data available online to learn from. - The experiments seem thorough, with clear improvements compared to prior methods on a suite of different complex tasks within Minecraft. The paper considers
- While the multi-step video representation for goals is more expressive than other prior Minecraft agents (that use language or outcome videos), having to provide a video to condition on can be difficult to do in practice if we do not already have access to similarly representative videos, and as the authors note, training the video-based goal space is challenging. On the other hand, it is much easier to describe a desired goal in language. It would be interesting to see if this method can be a
Significance: This paper addresses an interesting problem, that is, using gameplay videos to train instruction-following controllers, and validates the effectiveness of videos compared to visual information and text. Originality: The main innovation of this paper lies in proposing a method for learning the representation of video instructions (i.e., goal space) and providing relevant theoretical derivations. Through ablation experiments, the paper demonstrates that imposing constraints on the g
1. The modeling method for video instruction in this paper is innovative and can improve performance. However, since the low-level controller still needs to select task-related videos as instructions when performing specific tasks (Section 5.1), it cannot be well adapted to LLM-based planning agents. Although the video instructions used are generated from other biomes, it only proves the controller's generalization ability in specific tasks, rather than the generalization ability for video instr
The newly designed benchmark is a nice addition to help the evaluation of the proposed agent. It covers a wide range of different activities in the Minecraft environment, including some long-horizon tasks, building tools. Release of the evaluation benchmark is able to help the community. The design of the model architecture intuitively makes sense.
There are not enough training details disclosed in the paper. Ablation on the KL loss is nice. More ablation studies on for example the number of learnable tokens would be appreciated. These experiments will further validate the robustness of the model for the task. The training of the model still requires action input. This means that for raw video, GROOT relies on inverse dynamics model to generate pseudo action labels. The idea of an agent learning from video might have oversold the novelty
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games
