TL;DR
TR-ICRL introduces a test-time rethinking framework for in-context reinforcement learning, enhancing reward estimation and iterative answer refinement in large language models for reasoning tasks.
Contribution
It proposes a novel framework that retrieves relevant instances, generates candidate answers, and uses majority voting for pseudo-labels to improve LLM performance during inference.
Findings
TR-ICRL improves Qwen2.5-7B by 21.23% on MedQA.
Achieves 137.59% improvement on AIME2024.
Demonstrates robustness through extensive ablation studies.
Abstract
In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
