ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue
Ruike Cao, Shaojie Bai, Fugen Yao, Liang Dong, Jian Xu, Li Xiao

TL;DR
This paper introduces ATPO, an uncertainty-aware adaptive tree policy optimization algorithm designed for multi-turn medical dialogue systems, improving diagnosis accuracy by better handling long-horizon decision-making and uncertainty in LLM-based interactions.
Contribution
The paper proposes a novel ATPO algorithm that adaptively allocates rollouts based on uncertainty, with optimizations for computational efficiency, outperforming existing methods in medical dialogue benchmarks.
Findings
ATPO outperforms strong baselines on medical dialogue benchmarks.
Qwen3-8B with ATPO surpasses GPT-4o by 0.92% accuracy.
Uncertainty-guided pruning and asynchronous search improve efficiency.
Abstract
Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value…
Peer Reviews
Decision·ICLR 2026 Poster
The paper makes a good extension to recent works on using TreePO for training RL models for reasoning tasks and conversation forests on the task of answering MCQs in medical domain. This is an important task in the important domain of medical diagnosis that may specifically prefer small, in-house models than proprietary ones such as GPT. The paper is well-written, building up the intuition for the proposed metric as well as the developing equations for actor/policy and critic/value updates. Th
Not weaknesses specifically but some areas for improvement where details are lacking or there are clarity issues are listed below-- - Contributions (line 84) --This aspect is not really highlighted much in the paper and as such the setup seems identical to that described in the TreePO paper (Li et al 2025b cited by the authors). If different, please highlight the differences and why this forms a core contribution. - In general, the performances in Table-1 seem very close to TreePO in many ins
(1) This paper has some theoretical derivation for the framework.
1) However, I find some parts of the theoretical derivation questionable. 1) When defining the Bellman error in equations (1) and (2), the authors use a one-step lookahead. This is questionable since with one step, there is no long-term reward (i.e., long-term exploration). In this case, equations (1) and (2) would collapse to the average reward of all the states. If the authors finally generate an answer based on the whole dialogue, it makes more sense to enable the Bellman error to look multip
- The notion to expand nodes and allocate more rollout budget to states with high uncertainty is sound. I believe the overall design (with some abstractions) could be applicable to other dialogue tasks beyond medical benchmarks. - The authors provided evaluation on three different medical benchmarks, showing improvement of ATPO compared to other training methods such as GRPO and TreePO. - The authors also provided interesting analysis of performing tree-based training during RL. Specifically,
Despite the overall positive experimental results, I believe there are some uncertainty/flaw in the experimental setup that could substantially undermine the results and comparisons made. If these are addressed I am willing to increase my soundness and overall score. I detail them below. 1. Tree based methods generally requires much more compute compared to methods such as PPO/GRPO, as they need to perform multiple policy and value inference per state. However, neither Table 1 nor Figure 2 rep
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Multimodal Machine Learning Applications
