# Epistemic and ethical limits of large language models in evidence-based medicine: from knowledge to judgment

**Authors:** Wenxiu Qi, Longfei Pan

PMC · DOI: 10.3389/fdgth.2025.1706383 · Frontiers in Digital Health · 2026-01-20

## TL;DR

This paper explores how general AI models perform in medical tasks, finding they can generate plausible answers but lack accuracy and accountability.

## Contribution

The study introduces empirical evaluation of general LLMs in evidence-based medicine and philosophical analysis of their limitations.

## Key findings

- LLMs produce coherent but inaccurate medical outputs in evidence tasks.
- Models struggle with numerical accuracy and source verification.
- Philosophical risks include lack of clinical responsibility and institutional norms.

## Abstract

The rapid evolution of general large language models (LLMs) provides a promising framework for integrating artificial intelligence into medical practice. While these models are capable of generating medically relevant language, their application in evidence inference in clinical scenarios may pose potential challenges. This study employs empirical experiments to analyze the capability boundaries of current general-purpose LLMs within evidence-based medicine (EBM) tasks, and provides a philosophical reflection on their limitations.

This study evaluates the performance of three general-purpose LLMs, including ChatGPT, DeepSeek, and Gemini, when directly applied to core tasks of EBM. The models were tested in a baseline, unassisted setting, without task-specific fine-tuning, external evidence retrieval, or embedded prompting frameworks. Two clinical scenarios, namely SGLT2 inhibitors for heart failure and PD-1/PD-L1 inhibitors for advanced NSCLC were used to assess performance in evidence generation, evidence synthesis, and clinical judgment. Model outputs were evaluated using a multidimensional rubric. The empirical results were analyzed from an epistemological perspective.

Experiments show that the evaluated general-purpose LLMs can produce syntactically coherent and medically plausible outputs in core evidence-related tasks. However, under current architectures and baseline deployment conditions, several limitations remain, including imperfect accuracy in numerical extraction and processing, limited verifiability of cited sources, inconsistent methodological rigor in synthesis, and weak attribution of clinical responsibility in recommendations. Building on these empirical patterns, the philosophical analysis reveals three potential risks in this testing setting, including disembodiment, deinstitutionalization, and depragmatization.

This study suggests that directly applying general-purpose LLMs to clinical evidence tasks entails some limitations. Under current architectures, these systems lack embodied engagement with clinical phenomena, do not participate in institutional evaluative norms, and cannot assume responsibility for reasoning. These findings provide a directional compass for future medical AI, including ground outputs in real-world data, integrate deployment into clinical workflows with oversight, and design human-AI collaboration with clear responsibility.

## Linked entities

- **Diseases:** heart failure (MONDO:0005252)

## Full-text entities

- **Genes:** CD274 (CD274 molecule) [NCBI Gene 29126] {aka ADMIO5, B7-H, B7H1, PD-L1, PDCD1L1, PDCD1LG1}, PDCD1 (programmed cell death 1) [NCBI Gene 5133] {aka ADMIO4, AIMTBS, CD279, PD-1, PD1, SLEB2}
- **Diseases:** heart failure (MESH:D006333)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12864482/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12864482/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/PMC12864482/full.md

---
Source: https://tomesphere.com/paper/PMC12864482