TL;DR
This paper reveals that large language models can perform in-context reinforcement learning during inference, improving their responses through a simple multi-round prompting framework that uses scalar rewards for self-improvement.
Contribution
The authors introduce ICRL prompting, a novel inference-time self-improvement method enabling LLMs to optimize responses via scalar rewards, demonstrating emergent reinforcement learning behavior.
Findings
Response quality improves with more context during ICRL prompting.
ICRL outperforms baselines like Self-Refine and Reflexion.
Even self-generated rewards enhance LLM performance.
Abstract
Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
