Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

Sunday Oyinlola Ogundoyin; Muhammad Ikram; Rahat Masood

arXiv:2605.20591·cs.CL·May 21, 2026

Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

Sunday Oyinlola Ogundoyin, Muhammad Ikram, Rahat Masood

PDF

TL;DR

This study evaluates the safety, factual accuracy, and policy compliance of over 6,200 medical large language models deployed on the web, revealing systemic risks and gaps in safeguards.

Contribution

It introduces new evaluation frameworks for hallucination and policy violations, and releases a dataset to support future safety research of medical LLMs.

Findings

01

25-30% of MedGPTs have low factual accuracy

02

33.6-54.3% violate operational thresholds

03

57.06% lack adequate privacy disclosures

Abstract

Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.