Agentic Policy Optimization via Instruction-Policy Co-Evolution
Han Zhou, Xingchen Wan, Ivan Vuli\'c, Anna Korhonen

TL;DR
This paper introduces INSPO, a dynamic instruction-policy co-evolution framework that improves reinforcement learning agents by continuously optimizing instructions during training, leading to better reasoning and retrieval performance.
Contribution
The paper proposes a novel co-evolution approach that dynamically updates instructions in RL, enhancing agent reasoning and outperforming static instruction baselines.
Findings
INSPO outperforms static instruction baselines in reasoning tasks.
It discovers innovative instructions that improve strategic reasoning.
Achieves substantial performance gains with minimal computational overhead.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent's policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Topic Modeling
