Provable and Practical In-Context Policy Optimization for Self-Improvement
Tianrun Yu, Yuxiao Yang, Zhaoyang Wang, Kaixiang Zhao, Porter Jenkins, Xuchao Zhang, Chetan Bansal, Huaxiu Yao, Weitong Zhang

TL;DR
This paper introduces ICPO, a method enabling language models to improve their responses through self-reflection at inference time, backed by theory and practical algorithms that enhance reasoning performance.
Contribution
The paper presents a novel theoretical framework for self-reflection in language models and proposes ME-ICPO, a practical algorithm for in-context policy optimization during inference.
Findings
ME-ICPO achieves top-tier performance on mathematical reasoning tasks.
Theoretical proof shows single-layer attention can imitate policy optimization algorithms.
ME-ICPO maintains inference efficiency while improving reasoning accuracy.
Abstract
We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters. To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time. By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority…
Peer Reviews
Decision·ICLR 2026 Poster
The authors derive provable guarantees showing that a linear self-attention transformer, when trained under a Fisher-weighted objective, can imitate the behavior of a policy optimization algorithm in a linear bandit setting. This is a novel result from the theoretical perspective. The paper proposed Minimum-Entropy ICPO (ME-ICPO) algorithm which demonstrates a practical and implementable version of in-context policy optimization. It integrates entropy-regularized response selection and self-as
The effectiveness of ME-ICPO depends on choices such as number of refinement rounds, sample count per round, and entropy thresholds. Tuning those hyperparameters are non-trivial and might heavily depend on model sizes and datasets.
1. It is interesting to formulate ICPO as a bandit-style policy optimization approach. The theoretical grounding for in-context self-refinement is potentially impactful if the claims hold in more realistic settings. 2. The framework and algorithm diagrams are well-organized, and the writing is mostly easy to follow.
1. The theoretical framework in Section 4 uses a linear bandit abstraction and a simplified linear self-attention model, whereas ME-ICPO is demonstrated with models like Qwen2.5-Math-7B. It is not clear how these theoretical assumptions connect to the practical model choices. 2. ICPO requires iterative sampling, which implicitly increases inference compute. The paper only compares with the base model; since this is technically a prompting technique, it is unclear how this improvement differs fr
1) **Clear mechanistic link:** a theoretically grounded account connecting pretraining under a Fisher‑weighted objective to in‑context policy‑optimization behavior in an LSA. 2) **Practicality:** ME‑ICPO yields strong math‑reasoning gains with gradient‑free test‑time optimization; **Mean@16 can surpass the base model’s majority‑vote upper bound**, and adding majority vote on ME‑ICPO output brings further gains.
- **No variability reported in Table 1.** Table 1 reports only point estimates (Accuracy and Mean@16) with no variability across multiple runs; please add mean±std over, e.g., 5 seeds. - **Theory scope.** Guarantees apply to a **single‑layer LSA** and **linear bandits**; practical models may not be LSA, so the theoretical guarantees do not directly cover the standard non-LSA archetictures.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Advanced Bandit Algorithms Research
