Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

Joey Hong; Anca Dragan; Sergey Levine

arXiv:2505.18098·cs.CL·December 4, 2025

Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

Joey Hong, Anca Dragan, Sergey Levine

PDF

TL;DR

This paper introduces a goal-conditioned value function approach to improve LLM reasoning and planning in interactive tasks, overcoming RL fine-tuning limitations and enabling scalable, efficient multi-turn decision-making.

Contribution

The authors propose a novel value function method that guides LLMs in reasoning without extensive RL fine-tuning, scalable to large API-based models and effective in complex interactive tasks.

Findings

01

Outperforms RL fine-tuning and prompting methods in interactive tasks

02

Scales efficiently to large API-based LLMs

03

Demonstrates superior reasoning in tool use, social deduction, and dialogue

Abstract

Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.