STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft
Nicholas Lenzen, Amogh Raut, Andrew Melnik

TL;DR
This paper extends the STEVE-1 framework for embodied agents in Minecraft by incorporating audio as a new goal conditioning modality, enabling agents to follow audio instructions with performance comparable to text and visual modalities.
Contribution
It introduces a novel approach to map audio inputs into the latent goal space, expanding the control modalities for embodied agents in Minecraft.
Findings
Audio-conditioned agents perform comparably to text and visual-conditioned agents.
Developed an Audio-Video CLIP model for Minecraft.
Open-sourced training and evaluation tools for multi-modal agents.
Abstract
Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · AI-based Problem Solving and Planning · Semantic Web and Ontologies
MethodsContrastive Language-Image Pre-training
