Synthesis and Evaluation of Long-term History-aware Medical Dialogue
Hebin Hu, Renke Dai, Ah-Hwee Tan, Yilin Kang

TL;DR
This paper presents a framework for synthesizing and evaluating long-term, history-aware medical dialogues using LLMs, addressing the lack of realistic datasets for healthcare agent development.
Contribution
It introduces MediLongChat, a high-quality synthetic dataset with benchmark tasks and a comprehensive evaluation framework for healthcare dialogue memory capabilities.
Findings
State-of-the-art LLMs struggle with MediLongChat.
The dataset enables systematic evaluation of long-term medical dialogue reasoning.
Multi-dimensional metrics effectively assess data quality and model performance.
Abstract
An effective healthcare agent must be able to recall and reason over a patient's longitudinal medical history. However, the absence of datasets with realistic long-term dialogue timelines limits systematic evaluation. Real clinical text is constrained by privacy and ethics, while existing benchmarks focus on isolated interactions, failing to capture cross-session reasoning. We introduce a framework for synthesizing high-quality, long-term medical dialogues with LLMs. Our approach entails a knowledge-guided decomposition into three stages: constructing synthetic patient profiles with diverse disease and complication trajectories, generating multi-turn dialogues per encounter, and integrating them into a coherent longitudinal history dataset, MediLongChat. We establish three benchmark tasks-In-dialogue Reasoning, Cross-dialogue Reasoning, and Synthesis Reasoning-to evaluate the memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
