GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents

Shaofei Cai; Bowei Zhang; Zihao Wang; Haowei Lin; Xiaojian Ma; Anji; Liu; Yitao Liang

arXiv:2412.10410·cs.AI·December 17, 2024

GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents

Shaofei Cai, Bowei Zhang, Zihao Wang, Haowei Lin, Xiaojian Ma, Anji, Liu, Yitao Liang

PDF

Open Access

TL;DR

GROOT-2 introduces a semi-supervised, multimodal instruction-following agent that leverages weak supervision and latent variable models to learn from unlabeled demonstrations and align with human intentions, improving performance across diverse environments.

Contribution

The paper presents GROOT-2, a novel semi-supervised approach combining weak supervision with latent models for multimodal instruction following in robotics and AI.

Findings

01

Effective in four diverse environments

02

Learns from unlabeled demonstrations

03

Aligns latent space with human intentions

Abstract

Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI. Although large-scale pre-training on unlabeled datasets (no language instruction) has enabled agents to learn diverse behaviors, these agents often struggle with following instructions. While augmenting the dataset with instruction labels can mitigate this issue, acquiring such high-quality annotations at scale is impractical. To address this issue, we frame the problem as a semi-supervised learning task and introduce GROOT-2, a multimodal instructable agent trained using a novel approach that combines weak supervision with latent variable models. Our method consists of two key components: constrained self-imitating, which utilizes large amounts of unlabeled demonstrations to enable the policy to learn diverse behaviors, and human intention alignment, which uses a smaller set of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation · Natural Language Processing Techniques

MethodsSparse Evolutionary Training