Exploring the Boundaries of GPT-4 in Radiology

Qianchu Liu; Stephanie Hyland; Shruthi Bannur; Kenza Bouzid; Daniel C.; Castro; Maria Teodora Wetscherek; Robert Tinn; Harshita Sharma; Fernando; P\'erez-Garc\'ia; Anton Schwaighofer; Pranav Rajpurkar; Sameer Tajdin Khanna,; Hoifung Poon; Naoto Usuyama; Anja Thieme; Aditya V. Nori; Matthew P. Lungren,; Ozan Oktay; Javier Alvarez-Valle

arXiv:2310.14573·cs.CL·October 24, 2023·2 cites

Exploring the Boundaries of GPT-4 in Radiology

Qianchu Liu, Stephanie Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C., Castro, Maria Teodora Wetscherek, Robert Tinn, Harshita Sharma, Fernando, P\'erez-Garc\'ia, Anton Schwaighofer, Pranav Rajpurkar, Sameer Tajdin Khanna,, Hoifung Poon, Naoto Usuyama, Anja Thieme

PDF

Open Access

TL;DR

This paper evaluates GPT-4's performance on radiology report tasks, showing it often surpasses or matches specialized models, with strong zero-shot results and potential for clinical application.

Contribution

It demonstrates GPT-4's competitive performance in radiology text tasks, highlighting its zero-shot capabilities and potential as a generalist model in medical NLP.

Findings

01

GPT-4 outperforms SOTA models in temporal sentence similarity and natural language inference.

02

GPT-4 matches supervised models in findings summarisation with example prompting.

03

Error analysis shows GPT-4 has sufficient radiology knowledge with occasional nuanced errors.

Abstract

The recent success of general-domain large language models (LLMs) has significantly changed the natural language processing paradigm towards a unified foundation model across domains and applications. In this paper, we focus on assessing the performance of GPT-4, the most capable LLM so far, on the text-based applications for radiology reports, comparing against state-of-the-art (SOTA) radiology-specific models. Exploring various prompting strategies, we evaluated GPT-4 on a diverse range of common radiology tasks and we found GPT-4 either outperforms or is on par with current SOTA radiology models. With zero-shot prompting, GPT-4 already obtains substantial gains ( $\approx$ 10% absolute improvement) over radiology models in temporal sentence similarity classification (accuracy) and natural language inference ( $F_{1}$ ). For tasks that require learning dataset-specific style or schema…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Dense Connections · Absolute Position Encodings · Adam · Label Smoothing · Residual Connection