ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions
Monica Munnangi, Saiph Savage

TL;DR
This paper introduces ThReadMed-QA, a new benchmark of real multi-turn patient-physician dialogues from Reddit, revealing significant challenges for current language models in maintaining accuracy over multiple turns.
Contribution
It provides a novel, authentic multi-turn medical dialogue dataset and evaluates state-of-the-art LLMs, highlighting their limitations in multi-turn medical question-answering.
Findings
Models' accuracy drops significantly after initial turns.
High initial performance correlates with steeper decline in accuracy.
Nearly one-third of conversations show fluctuating correctness within the same thread.
Abstract
Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Neurobiology of Language and Bilingualism
