Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users

Melik Ozolcer; Sang Won Bae

arXiv:2510.17173·cs.AI·October 22, 2025

Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users

Melik Ozolcer, Sang Won Bae

PDF

Open Access

TL;DR

This paper evaluates a multi-turn LLM health coach using offline policy evaluation with real users, revealing subgroup-specific effects and proposing an evaluation-first approach for personalization.

Contribution

It introduces a subgroup-aware offline evaluation method for LLM health coaches and demonstrates its effectiveness in identifying subgroup harms and improving personalization strategies.

Findings

01

Heavy-tool policy increases overall value but harms low-literacy users.

02

A simulator with archetypes shortens trait identification and boosts success.

03

Early evaluation and subgroup metrics reveal hidden harms and guide personalization.

Abstract

We study a web-deployed, tool-augmented LLM health coach with real users. In a pilot with seven users (280 rated turns), offline policy evaluation (OPE) over factorized decision heads (Tool/Style) shows that a uniform heavy-tool policy raises average value on logs but harms specific subgroups, most notably low-health-literacy/high-self-efficacy users. A lightweight simulator with hidden archetypes further shows that adding a small early information-gain bonus reliably shortens trait identification and improves goal success and pass@3. Together, these early findings indicate an evaluation-first path to personalization: freeze the generator, learn subgroup-aware decision heads on typed rewards (objective tool outcomes and satisfaction), and always report per-archetype metrics to surface subgroup harms that averages obscure.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInnovative Human-Technology Interaction · Data Visualization and Analytics · Electronic Health Records Systems