ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

Monica Munnangi; Saiph Savage

arXiv:2603.11281·cs.CL·March 13, 2026

ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

Monica Munnangi, Saiph Savage

PDF

Open Access

TL;DR

This paper introduces ThReadMed-QA, a new benchmark of real multi-turn patient-physician dialogues from Reddit, revealing significant challenges for current language models in maintaining accuracy over multiple turns.

Contribution

It provides a novel, authentic multi-turn medical dialogue dataset and evaluates state-of-the-art LLMs, highlighting their limitations in multi-turn medical question-answering.

Findings

01

Models' accuracy drops significantly after initial turns.

02

High initial performance correlates with steeper decline in accuracy.

03

Nearly one-third of conversations show fluctuating correctness within the same thread.

Abstract

Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Neurobiology of Language and Bilingualism