RLZero: Direct Policy Inference from Language Without In-Domain Supervision

Harshit Sikchi; Siddhant Agarwal; Pranaya Jajoo; Samyak Parajuli; Caleb Chuck; Max Rudolph; Peter Stone; Amy Zhang; Scott Niekum

arXiv:2412.05718·cs.AI·November 26, 2025

RLZero: Direct Policy Inference from Language Without In-Domain Supervision

Harshit Sikchi, Siddhant Agarwal, Pranaya Jajoo, Samyak Parajuli, Caleb Chuck, Max Rudolph, Peter Stone, Amy Zhang, Scott Niekum

PDF

Open Access 1 Video

TL;DR

RLZero enables zero-shot policy inference from natural language instructions by imagining, projecting, and imitating observations without in-domain supervision, leveraging pretrained RL agents and generative models.

Contribution

It introduces RLZero, a novel framework that infers policies directly from language without task-specific supervision or labeled data, using a three-step process.

Findings

01

First zero-shot language-to-behavior generation across multiple tasks.

02

Effective policy imitation from imagined observations.

03

Ability to generate policies from cross-embodied videos like YouTube.

Abstract

The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate. First, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

RLZero: Direct Policy Inference from Language Without In-Domain Supervision· slideslive

Taxonomy

TopicsAdvanced Malware Detection Techniques · Text Readability and Simplification · Software Engineering Research