Evaluating ChatGPT's Performance in Classifying Pneumonia from Chest X-Ray Images
Pragna Prahallad, Pranathi Prahallad

TL;DR
This study assesses ChatGPT's ability to classify pneumonia from chest X-ray images without prior training, finding moderate accuracy and emphasizing the need for further development for clinical use.
Contribution
It demonstrates ChatGPT's potential in medical image classification and evaluates the impact of different prompt designs on performance.
Findings
Concise prompts achieved 74% accuracy.
Reasoning prompts performed worse.
ChatGPT shows emerging potential but limited reliability.
Abstract
In this study, we evaluate the ability of OpenAI's gpt-4o model to classify chest X-ray images as either NORMAL or PNEUMONIA in a zero-shot setting, without any prior fine-tuning. A balanced test set of 400 images (200 from each class) was used to assess performance across four distinct prompt designs, ranging from minimal instructions to detailed, reasoning-based prompts. The results indicate that concise, feature-focused prompts achieved the highest classification accuracy of 74\%, whereas reasoning-oriented prompts resulted in lower performance. These findings highlight that while ChatGPT exhibits emerging potential for medical image interpretation, its diagnostic reliability remains limited. Continued advances in visual reasoning and domain-specific adaptation are required before such models can be safely applied in clinical practice.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
