NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context

Ben Yao; Qiuchi Li; Yazhou Zhang; Siyu Yang; Bohan Zhang; Prayag Tiwari; Jing Qin

arXiv:2505.08734·cs.CL·January 29, 2026

NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context

Ben Yao, Qiuchi Li, Yazhou Zhang, Siyu Yang, Bohan Zhang, Prayag Tiwari, Jing Qin

PDF

2 Datasets 3 Reviews

TL;DR

NurValues is a pioneering benchmark assessing how well large language models align with core nursing values in real-world clinical scenarios, highlighting ethical challenges and differences among models.

Contribution

This work introduces the first benchmark for nursing value alignment, including datasets from field studies and dialogue-based instances, to evaluate LLMs in clinical contexts.

Findings

01

General LLMs outperform medical LLMs in value alignment.

02

Justice is identified as the most challenging nursing value.

03

The benchmark reveals significant gaps in LLMs' ethical understanding in healthcare.

Abstract

While LLMs have demonstrated medical knowledge and conversational ability, their deployment in clinical practice raises new risks: patients may place greater trust in LLM-generated responses than in nurses' professional judgments, potentially intensifying nurse-patient conflicts. Such risks highlight the urgent need of evaluating whether LLMs align with the core nursing values upheld by human nurses. This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: Altruism, Human Dignity, Integrity, Justice, and Professionalism. We define two-level tasks on the benchmark, considering the two characteristics of emerging nurse-patient conflicts. The Easy-Level dataset consists of 2,200 value-aligned and value-violating instances, which are collected through a five-month longitudinal field study…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

The authors describe various ablations or slices to evaluate the performance. It is commendable that the authors collected real world data and extensively annotated.

Weaknesses

1. Significance: - Why is such a benchmark required? - Since all of the data situational not conversational, is the challenge different? 2. Novelty: - There has been similar exploration (https://pmc.ncbi.nlm.nih.gov/articles/PMC12099337/, https://arxiv.org/pdf/2505.04152, https://arxiv.org/abs/2409.15188) around LLMs for care and clinician-patient interaction. Or other hospital agent (https://dl.acm.org/doi/10.1145/3699765, https://arxiv.org/pdf/2401.05654). Where does a benchmark like this ad

Reviewer 02Rating 8Confidence 3

Strengths

* Clear problem framing. The benchmark is built around established nursing codes, providing strong domain grounding and clear construct definitions. * Good quality, realistic data with adversarial challenge cases. Easy-Level cases come from real nurse–patient dialogues. Hard-Level role-play and counterfactuals probe failure modes that is often omitted in other datasets. This two-tier design improves ecological validity and adversarial robustness assessment. * Careful annotation and reliability

Weaknesses

* Taxonomy coverage and balance. The benchmark focuses on five nursing value dimensions but several widely used nursing codes (e.g. privacy and confidentiality, advocacy, safety) seem to be missing [1-3]. * Adversarial data generation limitations. Most adversarial cases come from a single frontier model, which may introduce stylistic artefacts and attack-surface bias tied to that model (as shown by [4]). * Evaluation metrics could be richer. Accuracy and macro-F1 on imbalanced, ordinal-like l

Reviewer 03Rating 6Confidence 3

Strengths

S1: Good realistic dataset relating to clinical and nursing setting. - Authors carefully curated a real-world and diverse dataset describing nursing events happened in different types of hospital (rural, urban etc) with five-month field observational studies and five licensed nurse experts. - this resources could be very useful by being seed scenarios for many follow-up evaluations in this field S2: (Claimed) First work in nursing field to explore important topic (value alignment). - the value

Weaknesses

[minor] w1 Missing procedure details for deriving nursing values from principles/rules - Since the study builds on the five nursing values summarized in Section 2, it is important to justify how the authors identified and distilled these values (see lines 145–146). - recommend authors at least providing some examples of rules for each identified value. Ideally, they would release a dataset of rules/principles mapped with values to help readers and community to better understand. [minor] w2 Lack

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsCounterfactuals Explanations