ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue

Ruike Cao; Shaojie Bai; Fugen Yao; Liang Dong; Jian Xu; Li Xiao

arXiv:2603.02216·cs.LG·March 4, 2026

ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue

Ruike Cao, Shaojie Bai, Fugen Yao, Liang Dong, Jian Xu, Li Xiao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ATPO, an uncertainty-aware adaptive tree policy optimization algorithm designed for multi-turn medical dialogue systems, improving diagnosis accuracy by better handling long-horizon decision-making and uncertainty in LLM-based interactions.

Contribution

The paper proposes a novel ATPO algorithm that adaptively allocates rollouts based on uncertainty, with optimizations for computational efficiency, outperforming existing methods in medical dialogue benchmarks.

Findings

01

ATPO outperforms strong baselines on medical dialogue benchmarks.

02

Qwen3-8B with ATPO surpasses GPT-4o by 0.92% accuracy.

03

Uncertainty-guided pruning and asynchronous search improve efficiency.

Abstract

Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

The paper makes a good extension to recent works on using TreePO for training RL models for reasoning tasks and conversation forests on the task of answering MCQs in medical domain. This is an important task in the important domain of medical diagnosis that may specifically prefer small, in-house models than proprietary ones such as GPT. The paper is well-written, building up the intuition for the proposed metric as well as the developing equations for actor/policy and critic/value updates. Th

Weaknesses

Not weaknesses specifically but some areas for improvement where details are lacking or there are clarity issues are listed below-- - Contributions (line 84) --This aspect is not really highlighted much in the paper and as such the setup seems identical to that described in the TreePO paper (Li et al 2025b cited by the authors). If different, please highlight the differences and why this forms a core contribution. - In general, the performances in Table-1 seem very close to TreePO in many ins

Reviewer 02Rating 2Confidence 4

Strengths

(1) This paper has some theoretical derivation for the framework.

Weaknesses

1) However, I find some parts of the theoretical derivation questionable. 1） When defining the Bellman error in equations (1) and (2), the authors use a one-step lookahead. This is questionable since with one step, there is no long-term reward (i.e., long-term exploration). In this case, equations (1) and (2) would collapse to the average reward of all the states. If the authors finally generate an answer based on the whole dialogue, it makes more sense to enable the Bellman error to look multip

Reviewer 03Rating 4Confidence 4

Strengths

- The notion to expand nodes and allocate more rollout budget to states with high uncertainty is sound. I believe the overall design (with some abstractions) could be applicable to other dialogue tasks beyond medical benchmarks. - The authors provided evaluation on three different medical benchmarks, showing improvement of ATPO compared to other training methods such as GRPO and TreePO. - The authors also provided interesting analysis of performing tree-based training during RL. Specifically,

Weaknesses

Despite the overall positive experimental results, I believe there are some uncertainty/flaw in the experimental setup that could substantially undermine the results and comparisons made. If these are addressed I am willing to increase my soundness and overall score. I detail them below. 1. Tree based methods generally requires much more compute compared to methods such as PPO/GRPO, as they need to perform multiple policy and value inference per state. However, neither Table 1 nor Figure 2 rep

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Multimodal Machine Learning Applications