# Quality Assessment of Large Language Model–Generated Medical Dialogue for Clinical Vignettes: Evaluation Study

**Authors:** Yasutaka Yanagita, Daiki Yokokawa, Shiichi Ihara, Ryo Yoshida, Yoshihide Okano, Takanori Uehara

PMC · DOI: 10.2196/80752 · JMIR Formative Research · 2025-11-03

## TL;DR

This study evaluates the quality of AI-generated Japanese medical dialogues, finding they are mostly accurate and could help train doctors more efficiently.

## Contribution

The study introduces a method to generate and assess AI-created physician-patient dialogues for medical education.

## Key findings

- AI-generated dialogues scored an average of 5.7 out of 7 in overall quality.
- Chief concern and diagnosis were consistently included, but treatment course was never present.
- Physician responses had lower accuracy scores compared to patient responses.

## Abstract

Traditional clinical vignettes, though widely used in medical education, often focus on prototypical presentations; require substantial time and effort to develop; and fail to represent patient diversity, the complexity of clinical conditions, patients’ perspectives, and the dynamic nature of physician-patient interactions.

This study aimed to evaluate the quality of Japanese-language physician-patient dialogues produced by generative artificial intelligence (AI), focusing on their medical accuracy and overall appropriateness as medical interviews.

We created an AI prompt that included a specific clinical history and instructed the model to simulate a cooperative patient responding to the physician’s questions to generate a physician-patient dialogue. The target diseases were those covered by the Japanese National Medical Licensing Examination. Each dialogue consisted of 25 turns by the physician and 25 by the patient, reflecting the typical volume of conversation in Japanese outpatient settings. Three internists independently evaluated each generated dialogue using a 7-point Likert scale across 6 criteria: coherence of the conversation, medical accuracy of the patient’s responses, medical accuracy of the physician’s responses, content of the medical history, communication skills, and professionalism. In addition, a composite score for each dialogue was calculated as the overall mean of these 6 criteria. Each dialogue was also examined for the presence of 5 essential clinical components commonly included in medical interviews: chief concern and clinical course since onset, physical findings, test results, diagnosis, and treatment course. A dialogue was considered to include a component only if all 3 evaluators independently confirmed its presence.

The mean composite score was 5.7 (SD 1.0), indicating high overall quality. Mean scores for each criterion were as follows: coherence of the conversation, 5.9 (SD 0.9); medical accuracy of the patient’s responses, 6.0 (SD 0.9); medical accuracy of the physician’s responses, 5.6 (SD 1.1); content of medical history taking, 5.9 (SD 0.9); communication skills, 5.6 (SD 0.9); and professionalism, 5.5 (SD 1.1). Among the 5 clinical components assessed in each dialogue across 47 clinical cases, chief concern and clinical course were included in all 47 (100%) cases, physical findings in 15 (32%) cases, test results in 27 (57%) cases, diagnosis in 45 (96%) cases, and treatment course in 0 (0%) cases.

While physician oversight remains essential, it is feasible to efficiently create AI-generated educational materials for medical education that overcome the limitations of traditional clinical vignettes. This approach may reduce time and financial burdens, enhancing opportunities to practice clinical interviewing in settings that closely mirror real-world encounters.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12624296/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12624296/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/PMC12624296/full.md

---
Source: https://tomesphere.com/paper/PMC12624296