From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models
Theo Cachet, Christopher R. Dance, Olivier Sigaud

TL;DR
This paper presents a novel approach to building language-conditioned agents by decomposing the task into environment configuration selection based on vision-language model scores and goal-conditioned policy execution, enabling zero-shot generalization.
Contribution
The paper introduces a new method that combines vision-language models with goal-conditioned policies, improving zero-shot task generalization without extensive task-specific training.
Findings
Outperforms multi-task RL baselines in zero-shot generalization
Uses distilled models and multi-view evaluation to enhance performance
Does not require textual task descriptions during training
Abstract
Vision-language models (VLMs) have tremendous potential for grounding language, and thus enabling language-conditioned agents (LCAs) to perform diverse tasks specified with text. This has motivated the study of LCAs based on reinforcement learning (RL) with rewards given by rendering images of an environment and evaluating those images with VLMs. If single-task RL is employed, such approaches are limited by the cost and time required to train a policy for each new task. Multi-task RL (MTRL) is a natural alternative, but requires a carefully designed corpus of training tasks and does not always generalize reliably to new tasks. Therefore, this paper introduces a novel decomposition of the problem of building an LCA: first find an environment configuration that has a high VLM score for text describing a task; then use a (pretrained) goal-conditioned policy to reach that configuration. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
