# Auditing frontier general-purpose large language models in biomedical tasks: reasoning gains, extraction limits, and benchmark reliability

**Authors:** Yu Hou, Zaifu Zhan, Min Zeng, Yifan Wu, Shuang Zhou, Xiaoyi Chen, Huixue Zhou, Meijia Song, Rui Zhang

PMC · DOI: 10.21203/rs.3.rs-8605899/v1 · Research Square · 2026-02-18

## TL;DR

This paper evaluates how reliable large language models are for biomedical tasks and finds they are improving but still have limitations in structured tasks and benchmark accuracy.

## Contribution

A unified audit of general-purpose language models in biomedical tasks, revealing gains in reasoning and cost-effectiveness but also benchmark limitations.

## Key findings

- Frontier models show gains in clinical reasoning and multimodal QA but struggle with structured extraction tasks.
- Benchmark annotations are often outdated or ambiguous, leading to potential misestimation of model capabilities.
- Cost-normalized analysis shows higher accuracy at lower cost, suggesting practical deployment potential.

## Abstract

As large language models approach clinical deployment, their deployment-relevant reliability and the validity of the benchmarks used to assess it remain insufficiently examined. Here, we present a unified, reproducible, and human-centric audit of frontier general-purpose language models using representative biomedical text-mining tasks and nine biomedical question-answering benchmarks spanning reasoning-intensive, extraction-oriented, and multimodal settings. We observe consistent gains in clinical reasoning and multimodal biomedical QA; however, limitations in format-constrained tasks such as span-level extraction and evidence-dense summarization pose challenges for integration into structured clinical workflows, despite narrowing gaps with supervised systems. Blinded expert adjudication confirms more coherent and clinically plausible reasoning and further reveals that a substantial fraction of apparent errors arises from outdated or ambiguous benchmark annotations, suggesting that current benchmarks may misestimate model capability and potentially misguide deployment decisions. Cost-normalized analyses demonstrate that recent frontier models achieve higher accuracy at substantially lower cost per correct answer, reshaping practical deployment trade-offs for scalable digital medicine systems. Together, these findings suggest that general-purpose language models are approaching deployment-relevant reliability; however, safe and effective clinical use will require hybrid architectures, external grounding, and human-in-the-loop evaluation and expert oversight.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12934912/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12934912/full.md

## References

52 references — full list in the complete paper: https://tomesphere.com/paper/PMC12934912/full.md

---
Source: https://tomesphere.com/paper/PMC12934912