Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Kefan Song; Amir Moeini; Peng Wang; Lei Gong; Rohan Chandra; Shangtong Zhang; Yanjun Qi

arXiv:2506.06303·cs.LG·April 28, 2026

Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Kefan Song, Amir Moeini, Peng Wang, Lei Gong, Rohan Chandra, Shangtong Zhang, Yanjun Qi

PDF

1 Video

TL;DR

This paper reveals that large language models can perform in-context reinforcement learning during inference, improving their responses through a simple multi-round prompting framework that uses scalar rewards for self-improvement.

Contribution

The authors introduce ICRL prompting, a novel inference-time self-improvement method enabling LLMs to optimize responses via scalar rewards, demonstrating emergent reinforcement learning behavior.

Findings

01

Response quality improves with more context during ICRL prompting.

02

ICRL outperforms baselines like Self-Refine and Reflexion.

03

Even self-generated rewards enhance LLM performance.

Abstract

Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Reward Is Enough: LLMs Are In-Context Reinforcement Learners· slideslive