Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models
Sunday Oyinlola Ogundoyin, Muhammad Ikram, Rahat Masood

TL;DR
This study evaluates the safety, factual accuracy, and policy compliance of over 6,200 medical large language models deployed on the web, revealing systemic risks and gaps in safeguards.
Contribution
It introduces new evaluation frameworks for hallucination and policy violations, and releases a dataset to support future safety research of medical LLMs.
Findings
25-30% of MedGPTs have low factual accuracy
33.6-54.3% violate operational thresholds
57.06% lack adequate privacy disclosures
Abstract
Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
