GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents
Shaofei Cai, Bowei Zhang, Zihao Wang, Haowei Lin, Xiaojian Ma, Anji, Liu, Yitao Liang

TL;DR
GROOT-2 introduces a semi-supervised, multimodal instruction-following agent that leverages weak supervision and latent variable models to learn from unlabeled demonstrations and align with human intentions, improving performance across diverse environments.
Contribution
The paper presents GROOT-2, a novel semi-supervised approach combining weak supervision with latent models for multimodal instruction following in robotics and AI.
Findings
Effective in four diverse environments
Learns from unlabeled demonstrations
Aligns latent space with human intentions
Abstract
Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI. Although large-scale pre-training on unlabeled datasets (no language instruction) has enabled agents to learn diverse behaviors, these agents often struggle with following instructions. While augmenting the dataset with instruction labels can mitigate this issue, acquiring such high-quality annotations at scale is impractical. To address this issue, we frame the problem as a semi-supervised learning task and introduce GROOT-2, a multimodal instructable agent trained using a novel approach that combines weak supervision with latent variable models. Our method consists of two key components: constrained self-imitating, which utilizes large amounts of unlabeled demonstrations to enable the policy to learn diverse behaviors, and human intention alignment, which uses a smaller set of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
