Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs
Fei Wei, Daoyuan Chen, Ce Wang, Yilun Huang, Yushuo Chen, Xuchen Pan, Yaliang Li, Bolin Ding

TL;DR
This paper introduces Learn-to-Ask, a novel offline learning framework that enables large language models to become proactive, goal-oriented dialogue agents without relying on user simulators, demonstrated in real-world medical applications.
Contribution
The paper presents a simulator-free offline policy learning method for proactive LLMs, leveraging observed expert trajectories and a reward calibration pipeline to deploy effective dialogue agents.
Findings
Achieved superior performance to human experts in medical dialogue tasks.
Successfully deployed proactive LLMs in large-scale online AI services.
Demonstrated effectiveness across models up to 32B parameters.
Abstract
Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of…
Peer Reviews
Decision·Submitted to ICLR 2026
This paper seeks to conduct ``reward-mining'' from unsupervised offline logs, which is an important and cutting-edge topic. The intuition of reward design regarding when to stop is pretty straightforward, and the one regarding the observed future is correlated with topics in unsupervised RL like experienced-based learning or minimizing surprise. Overall I appreciate the topic, and find the methodology design interesting. The in-house benchmark construction and validation on real-world scenarios
I do find some weaknesses and questions and wish the authors could address: - Why the reward is designed in a multiplicative way? According to Table 1, a simple summation of the two rewards is used an ablation setting. It seems that the performance of the reward summation is comparable or even better (under the 32B setting) than that of the multiplicative reward. Therefore I'm especially curious about the motivation of the multiplicative reward design. - According to the ablated results in Tab
1. Designed a simulator-free offline training framework, which helps narrow the gap between simulated environments and real-world scenarios. 2. Introduced an elegant and well-grounded reward design that leverages the “future segments” of expert dialogue trajectories to infer dense rewards, effectively transforming sparse dialogue feedback into continuous supervision signals and reducing reliance on manual annotations. 3. Demonstrated superior empirical performance compared to baseline methods
1. The paper exhibits a somewhat marketing-oriented presentation style, introducing a number of new terms and concepts that are not strictly necessary. This makes the exposition less focused, and the core innovations and logical thread of the paper are not presented in a sufficiently clear or linear manner. 2. The claimed “framework innovation” primarily represents a system-level integration and engineering realization of existing methods in offline reinforcement learning and reward relabeling,
The problem being studied is well-motivated and significant (albeit not very original), as proactive LLMs are important tools in domains like healthcare. The authors’ approach performs well empirically on their healthcare testbed. The application to a real-world, large-scale medical AI system is also interesting.
The main weakness is in the novelty of the authors’ approach. The basic idea (transform the problem into a sequence of single-turn interactions, assign rewards to each, and fine-tune based on those rewards) is fairly boilerplate. There is also some amount of manual reward assignment, which may require changing in order to hold up in other domains beyond healthcare. In aggregate, this paper feels more like a nice engineering application of largely existing techniques, as opposed to a fundamentall
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Multimodal Machine Learning Applications
