Provable and Practical In-Context Policy Optimization for Self-Improvement

Tianrun Yu; Yuxiao Yang; Zhaoyang Wang; Kaixiang Zhao; Porter Jenkins; Xuchao Zhang; Chetan Bansal; Huaxiu Yao; Weitong Zhang

arXiv:2603.01335·cs.LG·March 3, 2026

Provable and Practical In-Context Policy Optimization for Self-Improvement

Tianrun Yu, Yuxiao Yang, Zhaoyang Wang, Kaixiang Zhao, Porter Jenkins, Xuchao Zhang, Chetan Bansal, Huaxiu Yao, Weitong Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ICPO, a method enabling language models to improve their responses through self-reflection at inference time, backed by theory and practical algorithms that enhance reasoning performance.

Contribution

The paper presents a novel theoretical framework for self-reflection in language models and proposes ME-ICPO, a practical algorithm for in-context policy optimization during inference.

Findings

01

ME-ICPO achieves top-tier performance on mathematical reasoning tasks.

02

Theoretical proof shows single-layer attention can imitate policy optimization algorithms.

03

ME-ICPO maintains inference efficiency while improving reasoning accuracy.

Abstract

We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters. To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time. By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The authors derive provable guarantees showing that a linear self-attention transformer, when trained under a Fisher-weighted objective, can imitate the behavior of a policy optimization algorithm in a linear bandit setting. This is a novel result from the theoretical perspective. The paper proposed Minimum-Entropy ICPO (ME-ICPO) algorithm which demonstrates a practical and implementable version of in-context policy optimization. It integrates entropy-regularized response selection and self-as

Weaknesses

The effectiveness of ME-ICPO depends on choices such as number of refinement rounds, sample count per round, and entropy thresholds. Tuning those hyperparameters are non-trivial and might heavily depend on model sizes and datasets.

Reviewer 02Rating 4Confidence 2

Strengths

1. It is interesting to formulate ICPO as a bandit-style policy optimization approach. The theoretical grounding for in-context self-refinement is potentially impactful if the claims hold in more realistic settings. 2. The framework and algorithm diagrams are well-organized, and the writing is mostly easy to follow.

Weaknesses

1. The theoretical framework in Section 4 uses a linear bandit abstraction and a simplified linear self-attention model, whereas ME-ICPO is demonstrated with models like Qwen2.5-Math-7B. It is not clear how these theoretical assumptions connect to the practical model choices. 2. ICPO requires iterative sampling, which implicitly increases inference compute. The paper only compares with the base model; since this is technically a prompting technique, it is unclear how this improvement differs fr

Reviewer 03Rating 4Confidence 2

Strengths

1) **Clear mechanistic link:** a theoretically grounded account connecting pretraining under a Fisher‑weighted objective to in‑context policy‑optimization behavior in an LSA. 2) **Practicality:** ME‑ICPO yields strong math‑reasoning gains with gradient‑free test‑time optimization; **Mean@16 can surpass the base model’s majority‑vote upper bound**, and adding majority vote on ME‑ICPO output brings further gains.

Weaknesses

- **No variability reported in Table 1.** Table 1 reports only point estimates (Accuracy and Mean@16) with no variability across multiple runs; please add mean±std over, e.g., 5 seeds. - **Theory scope.** Guarantees apply to a **single‑layer LSA** and **linear bandits**; practical models may not be LSA, so the theoretical guarantees do not directly cover the standard non-LSA archetictures.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Advanced Bandit Algorithms Research