# AI- vs Human-Based Assessment of Medical Interview Transcripts in a Generative AI–Simulated Patient System: Cross-Sectional Validation Study

**Authors:** Hiromizu Takahashi, Kiyoshi Shikino, Takeshi Kondo, Yuji Yamada, Yoshitaka Tomoda, Minoru Kishi, Yuki Aiyama, Sho Nagai, Akiko Enomoto, Yoshinori Tokushima, Takahiro Shinohara, Fumiaki Sano, Takeshi Matsuura, Rikiya Watanabe, Toshio Naito

PMC · DOI: 10.2196/81673 · JMIR Medical Education · 2026-02-17

## TL;DR

This study compares AI and human assessments of medical interviews and finds that AI provides fast, reliable, and accurate evaluations similar to those of human instructors.

## Contribution

The study demonstrates that AI-based assessment can match human evaluations in quality while significantly reducing time and improving consistency.

## Key findings

- AI-based assessments showed strong agreement with human assessments (r=0.90, concordance correlation coefficient=0.88).
- AI assessments were more consistent and reliable than human assessments, with lower variability and higher ICC values.
- AI reduced evaluation time by over 50% compared to human instructors.

## Abstract

Generative artificial intelligence (AI) is increasingly used in medical education, including AI-based virtual patients to improve interview skills. However, how much AI-based assessment (ABA) differs from human-based assessment (HBA) remains unclear.

This study aimed to compare the quality of clinical interview assessments generated via an ABA (GPT-o1 Pro [ABA-o1] and GPT-5 Pro [ABA-5]) with those generated via an HBA conducted by clinical instructors in an AI-based virtual patient setting. We also examined whether AI reduced evaluation time and assessed agreement across participants with different levels of clinical experience.

A standardized case of leg weakness was implemented in an AI-based virtual patient. Seven participants (2 medical students, 3 residents, and 2 attending physicians) each conducted an interview with the AI patient, and transcripts were scored using the 25-item Master Interview Rating Scale (0‐125). Three evaluation strategies were compared. First, GPT-o1 Pro and GPT-5 Pro scored each transcript 5 times with different random seeds to test case specificity. Processing time was logged automatically. Second, 5 blinded clinical instructors independently rated each transcript once using the same rubric. Third, reliability metrics were applied. For AI, intraclass correlation coefficients (ICCs) quantified repeatability. For humans, the ICC(2,1) was calculated. Agreement was quantified using the Pearson r, Lin concordance correlation coefficient, Bland-Altman limits of agreement, Cronbach α, and ICC. Time efficiency was expressed as mean minutes per transcript and relative percentage reduction.

Mean interview scores were similar across methods (ABA-o1: mean 52.1, SD 6.9; ABA-5: mean 53.2, SD 6.8; HBA: mean 53.7, SD 6.8). Agreement between ABA and HBA was strong (r=0.90; concordance correlation coefficient=0.88) with minimal bias (ABA-o1: mean 0.4, SD 2.7; ABA-5: mean 1.5, SD 5.2; limits of agreement: –4.9 to 5.7 for ABA-o1 and –8.6 to 11.7 for ABA-5). The Cronbach α was 0.81 (ABA-o1), 0.86 (ABA-5), and 0.80 (HBA); the ICC(3,1) was 0.77 (ABA-o1) and 0.82 (ABA-5); and the ICC(2,1) was 0.38 (HBA). The coefficient of variation for ABA was approximately half that of HBA (6.6% vs 13.9%). Processing time for 5 runs was 4 minutes, 19 seconds for ABA-o1 and 3 minutes, 20 seconds for ABA-5 vs 10 minutes, 16 seconds for physicians, corresponding to 58% and 67.6% reductions, respectively.

ABA-o1 and ABA-5 produced scores closely matching HBA while demonstrating superior consistency and reliability. In the setting of virtual interview transcripts, these findings suggest that ABA may serve as a valid, rapid, and scalable alternative to HBA, reducing per-assessment time by over half. Applied strategically, AI-based scoring could enable timely feedback, improve efficiency, and reduce faculty workload. Further research is needed to confirm generalizability across broader settings.

## Full-text entities

- **Genes:** TFAP2A (transcription factor AP-2 alpha) [NCBI Gene 7020] {aka AP-2, AP-2alpha, AP2TF, BOFS, TFAP2}, GPT (glutamic--pyruvic transaminase) [NCBI Gene 2875] {aka AAT1, ALT, ALT1, GPT1, SGPT}, MS 1 [NCBI Gene 4397], RP1 (RP1 axonemal microtubule associated) [NCBI Gene 6101] {aka DCDC4A, ORP1}, MAPRE3 (microtubule associated protein RP/EB family member 3) [NCBI Gene 22924] {aka EB3, EBF3, EBF3-S, RP3}, JUNB (JunB proto-oncogene, AP-1 transcription factor subunit) [NCBI Gene 3726] {aka AP-1}, RP2 (RP2 activator of ARL3 GTPase) [NCBI Gene 6102] {aka DELXp11.3, NM23-H10, NME10, TBCCD2, XRP2}
- **Diseases:** diarrhea (MESH:D003967), hyperthyroidism (MESH:D006980), LLMs (MESH:D007806), AI (MESH:C538142), OSCE (MESH:D020914), insomnia (MESH:D007319), leg weakness (MESH:D018908), hypokalemia (MESH:D007008), myalgias (MESH:D063806), thyrotoxic periodic paralysis (OMIM:188580), CBME (MESH:D019292), tremors (MESH:D014202)
- **Chemicals:** ABA (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12912650/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12912650/full.md

## References

19 references — full list in the complete paper: https://tomesphere.com/paper/PMC12912650/full.md

---
Source: https://tomesphere.com/paper/PMC12912650