Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction

Haoyu Wang; Yuxin Chen; Liang Luo; Buyun Zhang; Ellie Dingqiao Wen; Pan Li

arXiv:2603.23550·cs.LG·March 26, 2026

Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction

Haoyu Wang, Yuxin Chen, Liang Luo, Buyun Zhang, Ellie Dingqiao Wen, Pan Li

PDF

Open Access

TL;DR

This paper introduces Implicit Turn-wise Policy Optimization (ITPO), a reinforcement learning method that derives fine-grained, turn-level rewards from sparse signals to improve multi-turn human-AI interactions in various collaborative tasks.

Contribution

ITPO is a novel reinforcement learning approach that uses an implicit process reward model to enhance training stability and effectiveness in multi-turn AI-human collaboration.

Findings

01

ITPO improves convergence across multiple tasks.

02

Turn-wise rewards align with human judgment.

03

Enhanced training stability with normalization.

Abstract

Multi-turn human-AI collaboration is fundamental to deploying interactive services such as adaptive tutoring, conversational recommendation, and professional consultation. However, optimizing these interactions via reinforcement learning is hindered by the sparsity of verifiable intermediate rewards and the high stochasticity of user responses. To address these challenges, we introduce Implicit Turn-wise Policy Optimization (ITPO). ITPO leverages an implicit process reward model to derive fine-grained, turn-wise process rewards from sparse outcome signals. Unlike volatile token-level rewards, these turn-level signals exhibit superior robustness and may utilize a normalization mechanism to further enhance training stability. We evaluate ITPO across three representative multi-turn collaborative tasks: math tutoring, document writing, and medical recommendation. Empirical results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Intelligent Tutoring Systems and Adaptive Learning · Domain Adaptation and Few-Shot Learning