# Human-Machine Agreement in Medical Ethics: Patient Autonomy Case-Based Evaluation of Large Language Models

**Authors:** Vamshi Mugu, Brendan Carr, Ashish Khandelwal, Mike Olson, John Schupbach, John Zietlow, T N Diem Vu, Alex Chan, Christopher Collura, John Schmitz

PMC · DOI: 10.2196/77061 · JMIR Medical Informatics · 2025-10-23

## TL;DR

This paper evaluates how well large language models align with human experts on medical ethics cases involving patient autonomy and finds that improvements can significantly increase agreement.

## Contribution

The study introduces a novel approach to evaluate and improve foundational LLMs using patient autonomy-based hypothetical cases in medical ethics.

## Key findings

- Initial agreement between LLMs and physician consensus was slight to fair.
- Iterative improvements increased agreement to substantial or higher levels.
- Improvements were statistically significant across all tested LLMs.

## Abstract

Medical ethics provides a moral framework for the practice of clinical medicine. Four principles, that is, beneficence, nonmaleficence, patient autonomy, and justice, form the cornerstones of medical ethics as it is practiced today. Of these 4 principles, patient autonomy holds a pivotal position and often takes precedence in ethical dilemmas that result from conflicts among the 4 principles. Its importance serves as a constant reminder to the clinician that the “needs of the patient come first.” With their remarkable ability to process natural language, large language models (LLMs) have recently pervaded nearly every aspect of human life, including medicine and medical ethics. Reliance on tools such as LLMs, however, poses fundamental questions in medical ethics, where human-like reasoning, emotional intelligence, and an understanding of local context and values are of utmost importance.

While emphasizing the central role of the human factor, we undertake a bold venture to establish some confidence in LLMs, as it pertains to medical ethics by not only evaluating the status quo of foundational LLMs but also exploring ways to improve the LLMs by using patient autonomy–based hypothetical cases. Although literature today is certainly lacking in such ventures, we also believe projects such as ours must be frequently revisited in the field of LLMs, which is evolving at a pace that is both rapid and unprecedented.

We evaluated 3 foundational LLMs (ChatGPT, LLaMA, and Gemini) on hypothetical cases in patient autonomy. We used Cohen κ to compare LLM responses to the consensus from a physician panel. McNemar test was used during the improvement phase and to report the final significance of improved agreement of each LLM with physician consensus. P values less than .05 were considered significant. An agreement with κ<0 was designated as poor, 0-0.2 as slight, 0.2-0.4 as fair, 0.41-0.6 as moderate, 0.61-0.8 as substantial, and 0.81-1 as almost perfect.

There was slight to fair agreement between the foundational LLMs and the physician consensus. With iterative improvement techniques, this agreement evolved to be substantial or higher (Cohen κ of 0.73-0.82). The degree of improvement was statistically significant (P=.006 for ChatGPT, P<.001 for Gemini, and P<.001 for LLaMA).

Although LLMs hold great potential for use in medicine, there needs to be an abundance of caution in using foundational LLMs in domains such as medical ethics. With adequate human oversight in testing and utilizing established techniques, LLM responses can be better aligned to human responses, even in the domain of medical ethics.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12592888/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12592888/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/PMC12592888/full.md

---
Source: https://tomesphere.com/paper/PMC12592888