Automatic Replication of LLM Mistakes in Medical Conversations
Oleksii Proniakin, Diego Fajardo, Ruslan Nazarenko, Razvan Marinescu

TL;DR
This paper introduces MedMistake, an automated pipeline that extracts and benchmarks specific mistakes made by large language models in medical conversations, facilitating targeted evaluation and improvement.
Contribution
MedMistake automatically generates a dataset of medical conversation mistakes and creates a benchmark for evaluating LLM performance on these errors.
Findings
GPT models, Claude, and Grok perform best on the benchmark.
The dataset contains 3,390 QA pairs where models fail to answer correctly.
Medical experts validated a subset of questions for final evaluation.
Abstract
Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimensional rubrics which quantify reasoning quality, safety, and patient-centeredness. Yet, replicating specific mistakes in other LLM models is not straightforward and often requires manual effort. We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs. Our pipeline (1) creates complex, conversational data between an LLM patient and LLM doctor, (2) runs an evaluation with a committee of 2 LLM judges across a variety of dimensions and (3) creates simplified single-shot QA scenarios from those mistakes. We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges. We used medical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
