From Goal-Conditioned to Language-Conditioned Agents via Vision-Language   Models

Theo Cachet; Christopher R. Dance; Olivier Sigaud

arXiv:2409.16024·cs.AI·November 27, 2024

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Theo Cachet, Christopher R. Dance, Olivier Sigaud

PDF

Open Access

TL;DR

This paper presents a novel approach to building language-conditioned agents by decomposing the task into environment configuration selection based on vision-language model scores and goal-conditioned policy execution, enabling zero-shot generalization.

Contribution

The paper introduces a new method that combines vision-language models with goal-conditioned policies, improving zero-shot task generalization without extensive task-specific training.

Findings

01

Outperforms multi-task RL baselines in zero-shot generalization

02

Uses distilled models and multi-view evaluation to enhance performance

03

Does not require textual task descriptions during training

Abstract

Vision-language models (VLMs) have tremendous potential for grounding language, and thus enabling language-conditioned agents (LCAs) to perform diverse tasks specified with text. This has motivated the study of LCAs based on reinforcement learning (RL) with rewards given by rendering images of an environment and evaluating those images with VLMs. If single-task RL is employed, such approaches are limited by the cost and time required to train a policy for each new task. Multi-task RL (MTRL) is a natural alternative, but requires a carefully designed corpus of training tasks and does not always generalize reliably to new tasks. Therefore, this paper introduces a novel decomposition of the problem of building an LCA: first find an environment configuration that has a high VLM score for text describing a task; then use a (pretrained) goal-conditioned policy to reach that configuration. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings