Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL
Joey Hong, Anca Dragan, Sergey Levine

TL;DR
This paper introduces a goal-conditioned value function approach to improve LLM reasoning and planning in interactive tasks, overcoming RL fine-tuning limitations and enabling scalable, efficient multi-turn decision-making.
Contribution
The authors propose a novel value function method that guides LLMs in reasoning without extensive RL fine-tuning, scalable to large API-based models and effective in complex interactive tasks.
Findings
Outperforms RL fine-tuning and prompting methods in interactive tasks
Scales efficiently to large API-based LLMs
Demonstrates superior reasoning in tool use, social deduction, and dialogue
Abstract
Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
