Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Daniel R. Jiang; Jalaj Bhandari; Yukai Yang; R\'emi Munos; Tyler Lu

arXiv:2511.21638·cs.LG·November 27, 2025

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Daniel R. Jiang, Jalaj Bhandari, Yukai Yang, R\'emi Munos, Tyler Lu

PDF

Open Access

TL;DR

This paper introduces Iterative PPO, a method for training large language models in multi-turn conversations by reducing the problem to single-turn RLHF-style tasks, enabling stable and effective policy improvement.

Contribution

It formalizes a reduction from multi-turn to single-turn RL problems and proposes Iterative PPO, leveraging existing RLHF tools for improved multi-turn conversational AI training.

Findings

01

Iterative PPO effectively improves multi-turn conversational outcomes.

02

The method leverages off-the-shelf RLHF tools for stability.

03

It balances online adaptability with offline training stability.

Abstract

Optimizing large language models (LLMs) for multi-turn conversational outcomes remains a significant challenge, especially in goal-oriented settings like AI marketing or sales agents who facilitate transactions via messaging platforms. The difficulty stems from sparse, long-horizon rewards and the discrepancy between response-level planning and token-level generation. In this technical note, we propose a formal reduction of the multi-turn RL problem into a sequence of single-turn RLHF-style problems. This is achieved by setting a learned multi-turn Q-function as the reward model for the single-turn problem. We demonstrate and prove a key insight: solving this single-turn RL problem with standard token-level PPO is equivalent to a policy improvement step within the multi-turn problem. This insight naturally leads to Iterative PPO, a batch online policy iteration algorithm that alternates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques