Accuracy of Artificial Intelligence Versus Clinicians in Real-Life Case Scenarios of Retinopathy of Prematurity
Akash Belenje, Dhanush Pandya, Subhadra Jalali, Padmaja K Rani

TL;DR
This study compared ChatGPT's accuracy to clinicians in retinopathy of prematurity case scenarios, finding AI performed better than trainees and aligned well with experts.
Contribution
The study demonstrates that ChatGPT can outperform less experienced clinicians in ROP case scenarios and aligns closely with expert responses.
Findings
ChatGPT answered 90% of ROP cases correctly, outperforming trainees (77.5%).
ChatGPT showed substantial agreement with ROP experts (96% accuracy) with a Cohen’s kappa of 0.80.
AI responses were consistent across ChatGPT versions 3.5 and 4.0 over time.
Abstract
Objective The aim of this study was to compare the accuracy of ChatGPT artificial intelligence (AI) with clinicians in real-life case scenarios related to retinopathy of prematurity (ROP). Methods This was a prospectively conducted study with a real-life case scenario-based questionnaire with multiple-response answers. Thirteen clinicians, including eight vitreoretinal fellowship trainees (with less than two years of experience in the management of ROP) and five ROP experts (with more than three years of experience in the management of ROP), were given 10 real-life case scenarios in ROP. The majority of responses from trainees and ROP experts were compared with the ChatGPT AI-generated responses. The ChatGPT exercise was repeated for both versions 3.5 and 4.0 more than a month apart on May 29, 2024, and July 18, 2024, to check for the majority of AI response consistency. For each…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
| Case scenario | Response (May 29) | Response (July 18) | Cohen’s kappa |
| 1 | Screening must be done immediately | Screening must be done immediately | 1 |
| 2 | Screening must be done within 1 month | Screening must be done within 1 month | 1 |
| 3 | Treatment must be done immediately | Treatment must be done immediately | 1 |
| 4 | Observe and review in 2 weeks | Observe and review in 2 weeks | 1 |
| 5 | Observe and review in 1 week | Observe and review in 1 week | 1 |
| 6 | Intravitreal anti-VEGF | Intravitreal anti-VEGF | 1 |
| 7 | Immediate lens-sparing vitrectomy | Immediate lens-sparing vitrectomy | 1 |
| 8 | Posterior zone 2 definition incorrect | Posterior zone 2 definition incorrect | 1 |
| 9 | 1.25 mg in 0.05 mL | 1.25 mg in 0.05 mL | 1 |
| 10 | Treatment must be done immediately | Treatment must be done immediately | 1 |
| Question number | VR fellow response (8 VR fellows) | ROP expert response (5 ROP experts) | Majority of clinician responses (13 clinicians) | Majority of ChatGPT 3.5/4.0 responses (May 29, 2024, and July 18, 2024) | Cohen’s kappa |
| 1 | 6 correct, 2 wrong | 5 correct | 11 correct, 2 wrong; 84.6% agreement | 4 correct; 100% agreement | 0.82 |
| 2 | 5 correct, 3 wrong | 4 correct, 1 wrong | 9 correct, 4 wrong; 69.2% agreement | 4 correct; 100% agreement | 0.54 |
| 3 | 7 correct, 1 wrong | 5 correct | 12 correct, 1 wrong; 92.3% agreement | 4 correct; 100% agreement | 0.88 |
| 4 | 7 correct, 1 wrong | 5 correct | 12 correct, 1 wrong 92.3% agreement | 4 correct; 100% agreement | 0.88 |
| 5 | 6 correct, 2 wrong | 5 correct | 11 correct, 2 wrong; 84.6 % agreement | 4 correct; 100% agreement | 0.82 |
| 6 | 8 correct | 5 correct | 13 correct 100% agreement | 4 correct; 100% agreement | 1.0 |
| 7 | 8 correct | 5 correct | 13 correct 100% agreement | 4 correct; 100% agreement | 1.0 |
| 8 | 3 correct, 5 wrong | 4 correct, 1 wrong | 7 correct, 6 wrong 53.8% agreement | 4 wrong; 0% agreement | 0.08 |
| 9 | 5 correct, 3 wrong | 5 correct | 10 correct, 3 wrong 76.92% agreement | 4 correct; 100% agreement | 0.74 |
| 10 | 7 correct,1 wrong | 5 correct | 12 correct, 1 wrong 92.3% agreement | 4 correct; 100% agreement | 0.88 |
| Overall | 62 correct responses out of 80; 77.5% agreement | 48 correct responses out of 50; 96% agreement | 110 correct responses out of 130; 76.2% agreement | 36 correct responses out of 40; 90% agreement | 0.80 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRetinopathy of Prematurity Studies · Neonatal and fetal brain pathology · Neonatal Respiratory Health Research
Introduction
Retinopathy of prematurity (ROP) is the leading cause of preventable blindness in infants worldwide [1]. It is a disorder of immature retinal vasculature in preterm delivered babies [2]. It is a time-bound vasoproliferative disease, and important neonatal risk factors include early gestational age, low birth weight, suboptimal antenatal care, suboptimal neonatal care with supplementation of unblended unmonitored 100% oxygen, respiratory distress, apnea, poor weight gain, neonatal sepsis, and anemia [3-4]. The newer advances in pediatric retinal imaging have helped the clinician in better understanding and diagnosis of ROP [5-6]. However, the wide spectrum of clinical presentation and severity of the disease still pose a significant challenge to clinicians in the appropriate management of ROP.
Widefield imaging, tele ophthalmology, and artificial intelligence (AI) have reduced the screening and diagnostic workload of ophthalmologists [5-8]. Use of AI in the real world is predominantly focused on large volumes of data acquisition in building algorithms for accurate validation of clinical diagnosis [9,10]. AI models have been used in the past for the detection and grading of high-incidence ophthalmological diseases including cataract, glaucoma, diabetic retinopathy, age-related macular degeneration, retinal vein occlusion, and ROP [11]. ChatGPT (OpenAI Inc., San Francisco, CA, USA) is an AI-generated language model, with the ability to process and reply to text, images, and graphical inputs into human-like conversation outputs [12]. Recent studies have explored the reliability and accuracy of ChatGPT AI in providing information, screening, and interpretation of ophthalmic diseases in various case scenarios [13-15]. ROP is one of the most unpredictable vascular diseases posing screening, diagnosis, and management dilemma to treating ophthalmologists. Focused use of ChatGPT AI model in real-life ROP case scenarios in comparison to clinician response has never been explored before. Hence, the objective of our prospective study is to compare the accuracy of ChatGPT AI with clinicians in real-life case scenarios related to ROP.
Materials and methods
This was a prospective cross-sectional study with real-life case scenario-based questionnaire with multiple-response answers. The study was approved by the Institute Review Board (LEC-BHR-P-09-22-926) and adhered to the tenets of the Declaration of Helsinki. Ten real-life case scenarios pertaining to screening, diagnosis, and treatment of ROP were included. For each case scenario, three answer responses were given, and only one best response that best addressed the case scenario was to be selected. Exclusion criteria include an image-based questionnaire in ROP; multiple-response answers to a single question is not considered, and only one best response to a single question is considered. It was a multicenter study across three states of southern India. All the 10 real-life case scenario questionnaires were given to eight vitreoretinal fellowship trainees with less than two years of experience in the screening and management of ROP. The questionnaires were then given to five vitreoretinal consultant ROP experts with more than three years of experience in the screening and management of ROP. Informed consent was taken from all thirteen clinicians participating in the study. The responses from all 13 clinicians were noted for each question on May 24, 2024. The exact set of questionnaires was later fed to ChatGPT versions 3.5 and 4.0 on May 29, 2024. The ChatGPT exercise was repeated for both versionss 3.5 and 4.0 more than a month apart on July 18, 2024, to check for the majority of AI response consistency. The accuracy of ChatGPT in comparison to clinician response was analyzed. For each real-life case scenario, the majority of clinician responses were compared with the majority of AI responses for agreement.
Below are a few examples of case scenarios with ChatGPT responses:
Example 1 (prompt)
Case Scenario
A preterm baby was delivered at 26 weeks of gestational age with a birth weight of 1,200 grams and postmenstrual age of 30 weeks. Both eyes showed aggressive ROP with dilated and tortuous vessels in half zone with incomplete foveal vascularization.
The answer must be only one of the following:
-
Both eyes are treated with an intravitreal anti-vascular endothelial growth factor (VEGF) agent
-
Both eyes are treated with a laser indirect ophthalmoscope
-
No treatment is required.
ChatGPT Response
Aggressive ROP, particularly with the described vascular abnormalities, is at high risk for leading to severe outcomes, including retinal detachment. The treatment options typically include the following:
Intravitreal anti-VEGF agent: This treatment can be particularly effective in aggressive cases of ROP where rapid action may be needed to prevent progression. Anti-VEGF agents can cause regression of neovascularization and decrease vascular activity.
Laser indirect ophthalmoscope: This is a more traditional treatment that involves using a laser to ablate peripheral areas of the retina that are not fully vascularized. This treatment prevents further progression of the disease by reducing the drive for abnormal vessel growth.
Given the specifics of this scenario with aggressive ROP showing dilated and tortuous vessels and incomplete foveal vascularization, and considering the need for rapid and effective intervention:
- Both eyes are treated with an intravitreal anti-VEGF agent.
This option is chosen because anti-VEGF treatments are particularly effective for aggressive forms of ROP, as they can rapidly reduce the activity of the disease, helping in stabilizing the condition quicker than laser treatment, which is crucial given the severity described.
Example 2 (prompt)
Case Scenario
A preterm baby was delivered at 28 weeks of gestational age with a birth weight of 1,400 grams and a postmenstrual age of 40 weeks. Both eyes showed a well-dilated pupil with zone 2 stage 4B tractional retinal detachment.
-
Both eyes are treated with an intravitreal anti-VEGF agent alone.
-
Both eyes are treated with a laser indirect ophthalmoscope alone.
-
Both eyes immediate lens sparing vitrectomy with endo laser.
ChatGPT Response
Zone 2, stage 4B tractional retinal detachment indicates that the retina has already begun detaching due to the traction from fibrovascular membranes. This is a severe and advanced stage of ROP, which requires prompt and effective surgical treatment to attempt reattachment of the retina and prevent permanent vision loss. Immediate lens-sparing vitrectomy with endo laser involves the surgical removal of the vitreous that is tugging on the retina and the application of laser to the retina to secure it in place. This approach is suitable for managing severe tractional detachments such as stage 4B, as it directly addresses the mechanical aspect of the detachment.
Given these considerations, the correct treatment approach for this scenario would be:
- Both eyes immediate lens sparing vitrectomy with endo laser.
Example 3 (prompt)
Case Scenario
A preterm baby was delivered at 26 weeks of gestational age with a birth weight of 1,400 grams and a postmenstrual age of 30 weeks. Both eyes fundus showed aggressive ROP in zone 1. What is the dose of intravitreal bevacizumab injected?
The answer must be only one of the following:
-
1.25 mg in 0.05 mL
-
0.625 mg in 0.025 mL
-
1.25 mg in 0.025 mL
ChatGPT Response
For the treatment of aggressive ROP in zone 1 using intravitreal bevacizumab, the commonly administered dose according to established guidelines and clinical studies (such as the BEAT-ROP study) is:
- 0.625 mg in 0.025 mL per eye.
Results
Statistical analysis
SAS (Statistical Analysis System, Version 3.81, Enterprise Edition, SAS Institute Inc., Cary, NC, USA) software was used for the analysis. Intra-observer and inter-observer analysis was used for Cohen’s kappa analysis. The consistency of ChatGPT responses for both versions 3.5 and 4.0 is shown in Table 1. Cohen's kappa was 1.0, which indicates perfect agreement between the responses for both the versions of ChatGPT (3.5 and 4.0) from May 29 and July 18. The chi-square test p-value was 0. 0855, which indicates that there is no significant difference in the responses between the two dates. This p-value is greater than the common significance level of 0.05, indicating that there is no significant difference in the distribution of responses between the two dates.
Cohen’s kappa agreement between majority of clinician responses and ChatGPT majority of AI-generated responses for each question is shown in Table 2. ChatGPT answered nine questions correctly (90%), outperforming the fellowship trainees (77.5%, i.e., 62 correct responses out of 80). The accuracy of ROP experts was highest at 96% (i.e., 48 correct responses out of 50). There was substantial agreement between the majority of responses of clinicians and the ChatGPT response, with an average Cohen’s kappa of 0.80.
ChatGPT could not give an accurate answer to only one question, and the case scenario is described below.
Prompt
Key components of International Classification of Retinopathy of Prematurity (ICROP), 3rd edition, include all the below except:
The answer must be only one of the following:
-
Posterior zone 2 begins at the margin between zone 1 and extends into zone 2 for 2-disc diameters.
-
The term notch is used to describe an incursion by ROP lesion of 1-2 clock hours into a more posterior zone.
-
The term aggressive posterior retinopathy of prematurity (APROP) is used to describe severe ROP.
ChatGPT response
In the ICROP, 3rd edition, various terms and zones are specifically defined to guide the diagnosis and treatment of ROP. Understanding these definitions is crucial for the correct management of the disease. Let's look at the statements provided:
"Posterior zone 2 begins at the margin between zone 1 and extends into zone 2 for 2-disc diameters." This statement is incorrect as per the ICROP definitions. In the classification, zone 2 starts at the edge of zone 1 (defined as twice the distance from the optic disc to the macula, circular in shape) and extends outwards to the nasal ora serrata. It does not describe an extension "into zone 2 for 2-disc diameters."
"The term notch is used to describe an incursion by ROP lesion of 1-2 clock hours into a more posterior zone." This statement is accurate. The term "notch" in the context of ROP is used to describe the localized advancement of the disease into a more posterior zone, typically involving 1-2 clock hours.
"The term aggressive posterior retinopathy of prematurity (APROP) is used to describe severe ROP." This statement is also accurate. APROP is a term used in ROP classification to describe a particularly severe, rapidly progressing form of the disease that is characterized by its posterior location and poor differentiation between vascularized and avascular retina.
Given this analysis, the answer to the query about which statement is not included in the third edition of the ICROP is:
- Posterior zone 2 begins at the margin between zone 1 and extends into zone 2 for 2-disc diameters.
The ICROP, 3rd edition, has replaced the terminology APROP with A-ROP (aggressive retinopathy of prematurity), and the correct response is three. This was not correctly identified by ChatGPT.
Discussion
Previous studies have used ChatGPT responses in real-life case scenarios in diabetic retinopathy screening, interpretation of clinical ophthalmological images, and diagnosis of neuro-ophthalmology cases [14-16]. In our study, we have compared the accuracy of responses between ChatGPT versus vitreoretinal fellowship trainees versus ROP experts in various real-life case scenarios pertaining to ROP. We noticed that the ChatGPT response was more accurate when the type of screening and treatment guidelines were mentioned in the case scenario. An example is given below.
Prompt
A preterm baby was delivered at 30 weeks of gestational age with a birth weight of 1,500 grams and a postmenstrual age of 34 weeks. Both eyes showed ROP with a severity of zone 2, stage 3, plus disease.
According to the Early Treatment for Retinopathy of Prematurity (ETROP) study, the answer must be only one of the following:
-
Treatment must be done immediately.
-
Observe and review in two weeks.
-
No treatment is required.
ChatGPT response
For a preterm baby born at 30 weeks of gestational age, weighing 1,500 grams, and now at a postmenstrual age of 34 weeks, with both eyes showing retinopathy of prematurity (ROP) at zone 2, stage 3, plus disease, the treatment guidelines are specific. According to the Early Treatment for Retinopathy of Prematurity (ETROP) study, this condition requires immediate intervention.
Zone 2, stage 3, plus disease is considered severe enough to warrant urgent treatment because "plus disease" indicates progressive vascular dilation and tortuosity associated with worsening disease, which significantly increases the risk of poor outcomes, including retinal detachment.
Therefore, the correct response based on the scenario provided would be:
- Treatment must be done immediately.
This immediate treatment typically involves laser therapy or anti-VEGF injections to prevent the progression of the disease and potentially serious complications.
Similarly, ChatGPT response was more accurate when the Indian ROP guidelines for screening or ETROP guidelines for treatment is mentioned in the questionnaire.
The description of the answers was much more in detail in the ChatGPT 4.0 version when compared to ChatGPT 3.5 version. However, both the versions were consistent with perfect agreement when checked on two different dates six weeks apart (May 29 and July 18). ChatGPT outperformed the vitreoretinal trainees and had substantial agreement with majority of clinician responses. ChatGPT AI can assist general ophthalmologists and ROP trainees in screening babies for ROP and referring them to higher-level treatment centers when necessary.
The strengths of the study are that the study questionnaire and case scenarios were constructed and validated by two ROP experts with more than five years of experience. The study was conducted among ROP specialists, including trainees, to understand the need for and utility of the ChatGPT tool in real-world scenarios. The study validated both the free and paid versions of the ChatGPT tool. The limitations of the study were the small sample size and that there were no ROP image-based questionnaires and hence we could not compare ChatGPT performance with clinicians in image-based case scenarios. ChatGPT AI is a language model and not yet robust in giving analysis for complex images of ROP. Hence, we have used ChatGPT responses only for real-life case scenario-based questionnaires in ROP. However, future versions of ChatGPT AI models with robust image analysis will be much more helpful in assisting general ophthalmologists and ROP trainees in screening babies for ROP.
Summary
What Was Known Before
AI models have been used in the past for the detection and grading of high-incidence ophthalmological diseases including cataract, glaucoma, diabetic retinopathy, age-related macular degeneration, retinal vein occlusion, and ROP. ChatGPT is an AI-generated language model, with the ability to process and reply to text, images, and graphical inputs into human-like conversation outputs.
What This Study Adds
The study was conducted among ROP specialists, including trainees, to understand the need for and utility of the ChatGPT tool in real-world scenarios. The ChatGPT AI model showed substantial agreement with the majority of clinician responses. ChatGPT AI presents promising new software tools that can be explored further for use in real-life case scenarios in ROP. A more accurate prompt mentioning the type of screening guidelines can promote more accurate answers by ChatGPT as per the requested guidelines.
Conclusions
The ChatGPT AI model showed substantial agreement with the majority of clinician responses and performed better than vitreoretinal fellowship trainees. The ChatGPT responses are consistent with perfect agreement between the two versions (3.5 and 4.0) on different dates. A more accurate prompt mentioning the type of screening guidelines can promote more accurate answers by ChatGPT as per the requested guidelines. ChatGPT AI presents a promising new software tool that can be explored further for use in real-life case scenarios in ROP.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Epidemiology of blindness in children Arch Dis Child Solebo AL Teoh L Rahi J 85385710220172846530310.1136/archdischild-2016-310532 · doi ↗ · pubmed ↗
- 2Effect of oxygen on developing retinal vessels with particular reference to the problem of retrolental fibroplasia Br J Ophthalmol Ashton N Ward B Serpell G 3974323819541317241710.1136/bjo.38.7.397PMC 1324374 · doi ↗ · pubmed ↗
- 3Risk factors for retinopathy of prematurity in premature born children Med Arch Alajbegovic-Halimic J Zvizdic D Alimanovic-Halilovic E Dodik I Duvnjak S 4094136920152684373610.5455/medarh.2015.69.409-413PMC 4720470 · doi ↗ · pubmed ↗
- 4Retinopathy of prematurity: a review of risk factors and their clinical significance Surv Ophthalmol Kim SJ Port AD Swan R Campbell JP Chan RV Chiang MF 6186376320182967961710.1016/j.survophthal.2018.04.002PMC 6089661 · doi ↗ · pubmed ↗
- 5Non-contact widefield neonatal retinal imaging for retinopathy of prematurity using the Clarus 700 high resolution true colour reflectance imaging Eye (Lond) Belenje A Reddy RU Optom B Agarwal K Parmeswarappa DC Jalali S 190419093720233619567410.1038/s 41433-022-02273-2PMC 10275887 · doi ↗ · pubmed ↗
- 6Evaluation of optical coherence tomography biomarkers to differentiate favourable and unfavourable responders to intravitreal anti-vascular endothelial growth factor treatment in retinopathy of prematurity Eye (Lond) Belenje A Reddy RU Parmeswarappa DC Padhi TR Subbarao B Jalali S 10971103382024 https://doi.org/10.1038/s 41433-023-02824-13796851710.1038/s 41433-023-02824-1PMC 11009246 · doi ↗ · pubmed ↗
- 7Artificial intelligence for the diagnosis of retinopathy of prematurity: a systematic review of current algorithms Eye (Lond) Ramanathan A Athikarisamy SE Lam GC 251825263720233657780610.1038/s 41433-022-02366-y PMC 10397194 · doi ↗ · pubmed ↗
- 8Outcome of real-time telescreening for retinopathy of prematurity using videoconferencing in a community setting in Eastern India Indian J Ophthalmol Padhi TR Bhunia S Das T 6977037220243838924110.4103/IJO.IJO_2024_23PMC 11168531 · doi ↗ · pubmed ↗
