Challenges of GPT-3-based Conversational Agents for Healthcare
Fabian Lechner, Allison Lahnala, Charles Welch, Lucie Flek

TL;DR
This paper examines the limitations and risks of using GPT-3-based models in medical question-answering systems, highlighting issues like inaccurate responses and unsafe recommendations through stress-testing procedures.
Contribution
It introduces a manual stress-testing procedure to evaluate GPT-3's performance in high-risk medical queries and analyzes the potential safety and accuracy concerns.
Findings
LLMs generate erroneous medical information
LLMs produce unsafe recommendations
Content may be offensive or inappropriate
Abstract
The potential to provide patients with faster information access while allowing medical specialists to concentrate on critical tasks makes medical domain dialog agents appealing. However, the integration of large-language models (LLMs) into these agents presents certain limitations that may result in serious consequences. This paper investigates the challenges and risks of using GPT-3-based models for medical question-answering (MedQA). We perform several evaluations contextualized in terms of standard medical principles. We provide a procedure for manually designing patient queries to stress-test high-risk limitations of LLMs in MedQA systems. Our analysis reveals that LLMs fail to respond adequately to these queries, generating erroneous medical information, unsafe recommendations, and content that may be considered offensive.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · AI in Service Interactions
Methodsfail
