Evaluating AI Versus Examiner Feedback in Ophthalmology Exit Examinations: A Pilot Study
Binita Panchasara, Andrew Malem, Paul Calcraft, Sean Zhou

TL;DR
This pilot study compares feedback from AI and human examiners in ophthalmology exams, finding that AI offers structured guidance while humans provide practical insights.
Contribution
The study introduces an AI-based platform for simulating ophthalmology exams and delivering structured feedback.
Findings
AI feedback was more structured and referenced specific frameworks.
Human feedback was practical and context-specific with experiential insights.
Both feedback types covered similar themes like empathy and communication clarity.
Abstract
Objectives This pilot study aims to compare the quality of feedback provided by generative artificial intelligence (AI) and an official Royal College Examiner for simulated clinical and communication scenarios designed to prepare candidates for the Royal College of Ophthalmologists Part 2 Oral Examination. Design Utilising GPT 3.5 and 4 (OpenAI, San Francisco, CA, USA), an interactive web-based platform has been created that is able to simulate both patient and examiner roles in oral examination scenarios and simultaneously provide feedback on a candidate’s performance. Feedback was provided solely using GPT-4 in combination with prompt techniques. A standardised patient was used to enact five clinical and communication scenarios that were each assessed by both the AI and a Royal College of Ophthalmologists Examiner. The transcripts from these sessions were thematically analysed…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
| Similarities | Differences | |
| Communication | Both emphasise empathy and professionalism in patient interactions. They agree on the importance of clarity in communication, the need to involve patients, and comprehensive history-taking. Both value structured communication, though Examiner A explicitly recommends frameworks like SPIKES. They stress the need for clear follow-up planning. | Examiner A offers more detailed and structured feedback, often rooted in specific frameworks. Examiner B gives more practical, situation-specific advice, focusing on how to improve in real-world scenarios. For example, while both discuss breaking bad news, Examiner A suggests using structured protocols, whereas Examiner B advises on situational methods. |
| Clinical | Both emphasise a systematic approach in clinical reasoning and image interpretation. They stress the importance of detailed descriptions and effective categorization of clinical information. They highlight the necessity of sensitivity and clarity in communication with patients/parents. Both recommend further study to enhance clinical skills and acknowledge the role of guidelines and landmark trials. | Examiner A focuses on broader and more systematic categorizations, while Examiner B provides more specific and practical guidance. Examiner B is more precise in advising on details for image interpretation and suggests specific tests during clinical examinations. Examiner A often recommends further study, whereas Examiner B concentrates on refining specific knowledge areas. |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Autopsy Techniques and Outcomes · Anatomy and Medical Technology
Introduction
At the heart of healthcare is a need for continuous learning and training, which is constantly being adapted to meet the rapidly changing landscape of evidence-based practice and social requirements. Several large language models (LLMs) have been tested to different degrees in healthcare, with models such as GPT-4 (OpenAI, San Francisco, CA, USA) and PaLM2 (Google, Inc., Mountain View, CA, USA) showing real promise in the field of postgraduate education [1-5].
As a general-purpose model, GPT-4 has demonstrated advanced reasoning and content generation abilities that rival those of experts in fields such as ophthalmology, internal medicine and emergency medicine [6-8]. Its accessibility enables even non-experts to generate tailored learning content and simulations for exam preparation.
Post-graduate medical examinations are notoriously demanding, requiring substantial time, effort and financial investment from already overburdened doctors. Oral examinations such as OSCEs, VIVAs and PACES, which typically represent the final stage of the exam process, can be particularly challenging to prepare for. Candidates often need to find peers to practise with or enrol onto expensive courses to gain the necessary practice and build their confidence. While studies have explored the capability of LLMs like GPT-4 to answer post-graduate level exam questions, there is currently no data on the use of such systems to provide interactive practice and feedback of sufficient quality to prepare individuals for their oral examinations. Such a model would provide a flexible and easily accessible means by which candidates could practice their communication skills individually, in order to gain the knowledge and confidence necessary to tackle in-person sessions or indeed the exam itself. Such a system may additionally have the potential to help international medical graduates integrate more easily into the UK healthcare system.
Early small-scale studies have demonstrated the ability of GPT models to help clinicians develop their communication skills to good effect in challenging scenarios. One study demonstrated promising results when GPT was used to test radiographers' communication skills with claustrophobic patients during an MRI scan [9]. Others have shown the potential to evaluate specialist communication skills in the context of a high-stress encounter or when discussing patient interventions [10]. Despite these advancements, however, concerns persist over the use of LLMs in medicine. Chief among these are the risks of propagating misinformation and the unreliability of outputs, which are the result of both the training datasets and the probabilistic nature of LLMs. Prompt engineering is one approach to reducing such risks by guiding generalist models such as GPT-4 to generate responses that are more accurate, reliable and clinically appropriate [1].
This paper looks at a conversational AI platform designed to enhance communication and clinical reasoning skills, using the Royal College of Ophthalmologists Part 2 Oral Examination as a benchmark. The primary objective is to evaluate and compare the feedback provided by AI in interactive clinical and communication scenarios against the feedback given by an official Royal College of Ophthalmologists Examiner. Through this pilot study, we aim to assess the quality of the AI’s feedback and better understand the role that generative AI can have in oral exam preparation.
Materials and methods
A platform was built combining existing generative LLMs (GPT 3.5 and 4) with computer voice recognition and synthesis to simulate oral examination scenarios for the Part 2 Fellowship of the Royal College of Ophthalmologists (FRCOphth) examination. In each communication scenario, the AI simulates both the patient and examiner, providing personalised feedback to the candidate at the end of the interaction. The clinical scenarios involve the AI acting as the examiner alone, asking questions on a specific topic, while simultaneously analysing the candidate’s performance, which is similarly fed back to them at the end. The feedback component is modelled using GPT-4 in combination with consistent and carefully structured prompting, which utilised techniques including series, parallel, few-shot and generated knowledge prompting in addition to retrieval augmented generation (RAG).
A standardised patient was used to simulate five clinical and communication scenarios which were recorded and evaluated by a Royal College Examiner (Examiner B) as well as simultaneously by the AI. Examiner A was blinded to the feedback outcomes of Examiner B and vice versa.
The transcript was thematically categorised, and qualitative content analysis was carried out utilising NVivo software (Lumivero, Burlington, MA, USA). Analysis of the final defined themes was examined between the LLM-derived and examiner-derived feedback transcripts.
Results
To summarise the findings of our thematic analysis, both examiner A (AI-powered) and examiner B provided feedback on similar themes, but may have had different approaches to this. Below is a table (Table 1) outlining the comparisons of their performance on the communication and clinical scenarios.
Summary of results
While both examiners provide valuable feedback, Examiner A (AI) is more structured and often protocol-driven, offering detailed insights based on specific frameworks, while Examiner B (Human) focused more on the specific situation. No hallucinations were noted in the AI feedback responses. Please see attached Appendix 1 for the full results of qualitative content analysis.
Discussion
This pilot study aimed to evaluate the quality of feedback provided by a conversational AI model compared to that of a Royal College Examiner in the context of preparing candidates for the Royal College of Ophthalmologists Part 2 Oral Examination. Our findings highlight that both the AI and human examiner provided feedback on similar themes both in the communication and clinical scenarios, although their approaches differed. The AI provided more structured, framework-driven feedback whereas the human examiner offered more practical, context-specific advice. While the AI demonstrated potential for consistent and standardised feedback, the human examiner provided more nuanced insights tailored to real-world applications. This reflects the inherent differences in their respective interactions and cognitive pathways, with the AI’s feedback being driven by pattern recognition and reference to specific written frameworks, model answers and guidelines, while the human examiner’s insights are shaped by experiential knowledge, empathy and a deeper understanding of context and real-world complexities.
A key strength of this preliminary study is the direct comparison between AI-generated feedback and that provided by an official examiner, which is an important first step in understanding the potential role that AI can have in supporting doctors preparing for oral medical board examinations. With the rapid advancement of AI and LLMs, conversational AI is becoming an area of growing interest in medical education. The findings presented here contribute to the emerging body of evidence needed to critically assess and validate its use. While our results demonstrate GPT-4’s ability to produce relevant, detailed and structured feedback with careful prompting, it remains important to contextualise its performance alongside other AI models and platforms. A meaningful comparison is nonetheless difficult, as most existing studies focus on the use of LLMs to answer medical exam questions rather than their ability to generate feedback in simulated clinical interactions.
GPT-3, the predecessor of GPT-4, has been widely studied for its natural language generation capabilities; however, its outputs were often less accurate in domain-specific contexts due to limited reasoning depth and factual consistency. GPT-4 alternatively has been shown to be superior in both coherence and factual grounding, particularly when prompted effectively, as demonstrated by our results [7,11]. At the time of our study being conducted, GPT-4 was the most practical model to integrate into the test scenarios. While other LLMs such as Gemini/Bard and earlier versions of GPT were considered, GPT-4 provided the optimal balance between output quality and implementation feasibility.
Of particular interest is the prospect of using domain-specific models to improve the accuracy and consistency of AI-assisted medical education learning tools. Google’s Med-PaLM2 is one such model that demonstrated a high degree of accuracy, rivaling even that of experts, in answering USMLE-style questions [8]. More recently, the development of open-source models such as OpenBioLLM-8B and Palmyra-Med, have been preliminarily shown to out-perform existing leading models including Med-PaLM2 and GPT-4 in answering medical exam questions [7,8,12-15]. While these prospects are indeed exciting, there remains a lack of high-quality evidence to support their safe and effective use in healthcare, particularly in the context of interactive learning.
The study does have its limitations, the first of which is the small sample size and limited range of scenarios which restrict the generalisability of the findings. The study’s reliance on a single AI model and a single human examiner additionally limits the generalisability of findings, as it does not account for the variability in feedback styles and effectiveness across different models and examiners, and introduces a potential source of bias. Furthermore, the study did not assess the long-term impact of the AI's feedback on exam performance or clinical practice, leaving questions about its value in real-world settings.
One significant limitation to using general LLMs is their inherent variability and risk of hallucinations. To address this, we used a more powerful model (GPT-4) for feedback generation and applied consistent and carefully structured prompts, using a combination of techniques, across all scenarios. The prompts provided structured answer frameworks and guidelines, minimising reliance on the model’s general pretraining and reducing the extent to which clinical reasoning was required. While minor variation in phrasing was observed, the core content, thematic accuracy and feedback quality remained stable. Though this may be reasonable for the purpose of this small study, further research is necessary to quantify output variability more systematically for a wider context of clinical scenarios. In the meantime, users of AI-based learning tools should be aware of the potential for misinformation and corroborate content with trusted clinical sources.
A significant difference between the findings of this study and others is the depth of feedback provided by the AI model. While previous reports suggest that AI can provide broad feedback on performance, this study demonstrates that AI can deliver more objective, structured and systematic guidance based on established clinical frameworks [9,10]. Taking this into consideration, a hybrid model of learning may offer the greatest benefit whereby AI-based educational tools are used to complement traditional learning methods to address gaps in the AI's ability to provide “real-world” training and to help reduce the risk of promoting mechanical or overly formulaic responses should one become overly reliant on AI-generated content.
The use of an AI-based interactive educational platform for clinical exam preparation may be beneficial in helping doctors to build a foundational understanding of communication and clinical reasoning frameworks before engaging in traditional in-person simulation, which may be of particular benefit for international medical graduates or those unfamiliar with the specific expectations of Royal College examinations. For clinicians, the use of AI could mean more accessible and flexible learning options, with the potential to reduce the financial and time burdens associated with traditional preparation techniques. Furthermore, AI could help democratise access to high-quality educational resources, making them more available to a broader range of learners, including those in resource-poor settings.
Conclusions
Overall, this study provides an early insight into the potential role of generative AI in medical education, highlighting its strengths while acknowledging the need for further research and thoughtful integration. This study marks the beginning of a larger initiative to generate robust, quantitative evidence on the effectiveness of AI as a tool for oral examination training. Beyond this, there is a need to explore the long-term effects of AI-based learning on communication skills development, exam performance and ultimately, its impact on clinical practice.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Leveraging Chat GPT for ophthalmic education: a critical appraisal Eur J Ophthalmol Gurnani B Kaur K 3233273420243797442910.1177/11206721231215862 · doi ↗ · pubmed ↗
- 2Chat GPT versus human performance on emergency medicine board preparation questions Ann Emerg Med [Internet] Jarou ZJ Dakka A Mc Guire D Bunting L 8788832024 https://www.sciencedirect.com/science/article/pii/S 01960644230066373772501710.1016/j.annemergmed.2023.08.010 · doi ↗ · pubmed ↗
- 3Performance of Chat GPT on USMLE: potential for AI-assisted medical education using large language models PLOS Digit Health Kung TH Cheatham M Medenilla A 02202310.1371/journal.pdig.0000198 PMC 993123036812645 · doi ↗ · pubmed ↗
- 4Can Chat GPT pass the MRCP (UK) written examinations? Analysis of performance and errors using a clinical decision-reasoning framework BMJ Open [Internet] Maitland A Fowkes R Maitland S 0142024 http://bmjopen.bmj.com/content/14/3/e 080558.abstract 10.1136/bmjopen-2023-080558 PMC 1094634038490655 · doi ↗ · pubmed ↗
- 5Mo 1066 Large language model-based simulated patients with upper gastrointestinal bleeding for medical education - a pilot study with Empath GPT Gastroenterology Rajashekar N Chan C Laine L Shung D 09331662024 https://www.sciencedirect.com/science/article/abs/pii/S 0016508524026283
- 6Utilizing generative conversational artificial intelligence to create simulated patient encounters: a pilot study for anaesthesia training Postgrad Med J Sardesai N Russo P Martin J Sardesai A 23724110020243824005410.1093/postmj/qgad 137 · doi ↗ · pubmed ↗
- 7Large language models encode clinical knowledge Nature Singhal K Azizi S Tu T 17218062020233743853410.1038/s 41586-023-06291-2PMC 10396962 · doi ↗ · pubmed ↗
- 8Toward expert-level medical question answering with large language models Nat Med Singhal K Tu T Gottweis J 9439503120253977992610.1038/s 41591-024-03423-7PMC 11922739 · doi ↗ · pubmed ↗
