STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied   Agents in Minecraft

Nicholas Lenzen; Amogh Raut; Andrew Melnik

arXiv:2412.00949·cs.LG·December 3, 2024

STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

Nicholas Lenzen, Amogh Raut, Andrew Melnik

PDF

Open Access

TL;DR

This paper extends the STEVE-1 framework for embodied agents in Minecraft by incorporating audio as a new goal conditioning modality, enabling agents to follow audio instructions with performance comparable to text and visual modalities.

Contribution

It introduces a novel approach to map audio inputs into the latent goal space, expanding the control modalities for embodied agents in Minecraft.

Findings

01

Audio-conditioned agents perform comparably to text and visual-conditioned agents.

02

Developed an Audio-Video CLIP model for Minecraft.

03

Open-sourced training and evaluation tools for multi-modal agents.

Abstract

Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · AI-based Problem Solving and Planning · Semantic Web and Ontologies

MethodsContrastive Language-Image Pre-training