Authors’ Reply: Critical Limitations in Comparing ChatGPT and DeepSeek for Orthopedic Assessment

Chirathit Anusitviwat; Sitthiphong Suwannaphisit; Jongdee Bvonpanttarananon; Boonsin Tangtrakulwanich

PMC · DOI:10.2196/91470·March 17, 2026

Authors’ Reply: Critical Limitations in Comparing ChatGPT and DeepSeek for Orthopedic Assessment

Chirathit Anusitviwat, Sitthiphong Suwannaphisit, Jongdee Bvonpanttarananon, Boonsin Tangtrakulwanich

PDF

Open Access

Abstract

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals1

MCQ

Diseases1

pelvic and spine injury

Keywords

ChatGPTlarge language modelLLMorthopedicmultiple-choice questionMCQ

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills · Radiomics and Machine Learning in Medical Imaging

Full text

We thank you for the useful and constructive comments [1] on our article “Comparing ChatGPT and DeepSeek for Assessment of Multiple-Choice Questions in Orthopedic Medical Education: Cross-Sectional Study” [2]. This reply aims to address the concerning points that were brought up in the letter to the editor.

Misinterpretation of Reliability Statistics

According to our study, we administered the multiple-choice questions (MCQs) for ChatGPT and DeepSeek on a separate day. All data from the two large language models (LLMs) were measured by two assessors. Although two assessors were used for each LLM, the reported Cohen κ coefficient values represent within-model interrater reliability, not interrater reliability between the two LLMs [3]. Therefore, describing these results as agreement between the two models is inaccurate.

Linguistic Ambiguity and Generalizability

All MCQs used in our study were administered in English. No Thai language inputs or translations were used. Therefore, the performance differences between the two models reflect the model performance on English language medical questions rather than variability due to language translation or non-English linguistic processing.

Reproducibility and Interface Transparency

All models in our study were accessed via web-based user interfaces (UIs), not application programming interfaces. We acknowledge that web-based UIs may be subject to updates and lack version control. However, the web-based version of ChatGPT is easy to access and requires no software installation. It also allows quick testing and exploration without technical or cost barriers, making it well-suited for nontechnical users and educational studies [4]. Therefore, we used the web-based UI in our study.

Risk of Data Contamination

Even though these MCQs have been used for more than 5 years, the MCQs used in our study are from private orthopedic examinations. Thus, we believe that these items would not appear in public sources. Future research using newly created MCQs may be better for assessing the capability or efficacy of LLMs.

Data Reporting Discrepancy

Upon re-examination, we confirm that the correct accuracy for the pelvic and spine injury category (n=19) using the Reason function is indeed 16 of 19, corresponding to approximately 84.2%. The value of 68.8% reported in Table 2 was a typographical error. This error has been corrected through a published corrigendum [5].

Bibliography5

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ayas O Acar A Critical limitations in comparing Chat GPT and Deep Seek for orthopedic assessment JMIR Form Res 202610 e 90242 doi 10.2196/9024241843773 PMC 12994765 · doi ↗ · pubmed ↗
2Anusitviwat C Suwannaphisit S Bvonpanttarananon J Tangtrakulwanich B Comparing Chat GPT and Deep Seek for assessment of multiple-choice questions in orthopedic medical education: cross-sectional study JMIR Form Res Dec 1920259 e 75607 doi 10.2196/75607 Medline 41418321 PMC 12716854 · doi ↗ · pubmed ↗
3Mc Hugh ML Interrater reliability: the kappa statistic Biochem Med (Zagreb)2012223276282 Medline 23092060 PMC 3900052 · pubmed ↗
4Park CR Heo H Suh CH Shim WH Uncover this tech term: application programming interface for large language models Korean J Radiol Aug 2025268793796 doi 10.3348/kjr.2025.0360 Medline 40736411 PMC 12318651 · doi ↗ · pubmed ↗
5Anusitviwat C Suwannaphisit S Bvonpanttarananon J Tangtrakulwanich B Correction: comparing Chat GPT and Deep Seek for assessment of multiple-choice questions in orthopedic medical education: cross-sectional study JMIR Form Res Feb 26202610 e 92549 doi 10.2196/92549 Medline 41747218 PMC 12945383 · doi ↗ · pubmed ↗