Think Outside the Policy: In-Context Steered Policy Optimization

Hsiu-Yuan Huang; Chenming Tang; Weijie Liu; Clive Bai; Saiyong Yang; Yunfang Wu

arXiv:2510.26519·cs.LG·April 16, 2026

Think Outside the Policy: In-Context Steered Policy Optimization

Hsiu-Yuan Huang, Chenming Tang, Weijie Liu, Clive Bai, Saiyong Yang, Yunfang Wu

PDF

1 Repo

TL;DR

ICPO is a novel reinforcement learning framework that enhances Large Reasoning Models' reasoning abilities by leveraging in-context learning for expert guidance, improving exploration, stability, and performance without requiring advanced model trajectories.

Contribution

The paper introduces ICPO, a unified RLVR framework that uses in-context learning for expert guidance, expanding exploration and stabilizing training without relying on costly expert models.

Findings

01

ICPO improves reasoning performance on mathematical benchmarks.

02

ICPO enhances training stability and exploration in RLVR.

03

ICPO achieves consistent performance gains without advanced model trajectories.

Abstract

Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts which are confined to the current policy's distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advanced models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces mixed-policy GRPO with implicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Celine-hxy/ICPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.