NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context
Ben Yao, Qiuchi Li, Yazhou Zhang, Siyu Yang, Bohan Zhang, Prayag Tiwari, Jing Qin

TL;DR
NurValues is a pioneering benchmark assessing how well large language models align with core nursing values in real-world clinical scenarios, highlighting ethical challenges and differences among models.
Contribution
This work introduces the first benchmark for nursing value alignment, including datasets from field studies and dialogue-based instances, to evaluate LLMs in clinical contexts.
Findings
General LLMs outperform medical LLMs in value alignment.
Justice is identified as the most challenging nursing value.
The benchmark reveals significant gaps in LLMs' ethical understanding in healthcare.
Abstract
While LLMs have demonstrated medical knowledge and conversational ability, their deployment in clinical practice raises new risks: patients may place greater trust in LLM-generated responses than in nurses' professional judgments, potentially intensifying nurse-patient conflicts. Such risks highlight the urgent need of evaluating whether LLMs align with the core nursing values upheld by human nurses. This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: Altruism, Human Dignity, Integrity, Justice, and Professionalism. We define two-level tasks on the benchmark, considering the two characteristics of emerging nurse-patient conflicts. The Easy-Level dataset consists of 2,200 value-aligned and value-violating instances, which are collected through a five-month longitudinal field study…
Peer Reviews
Decision·ICLR 2026 Poster
The authors describe various ablations or slices to evaluate the performance. It is commendable that the authors collected real world data and extensively annotated.
1. Significance: - Why is such a benchmark required? - Since all of the data situational not conversational, is the challenge different? 2. Novelty: - There has been similar exploration (https://pmc.ncbi.nlm.nih.gov/articles/PMC12099337/, https://arxiv.org/pdf/2505.04152, https://arxiv.org/abs/2409.15188) around LLMs for care and clinician-patient interaction. Or other hospital agent (https://dl.acm.org/doi/10.1145/3699765, https://arxiv.org/pdf/2401.05654). Where does a benchmark like this ad
* Clear problem framing. The benchmark is built around established nursing codes, providing strong domain grounding and clear construct definitions. * Good quality, realistic data with adversarial challenge cases. Easy-Level cases come from real nurse–patient dialogues. Hard-Level role-play and counterfactuals probe failure modes that is often omitted in other datasets. This two-tier design improves ecological validity and adversarial robustness assessment. * Careful annotation and reliability
* Taxonomy coverage and balance. The benchmark focuses on five nursing value dimensions but several widely used nursing codes (e.g. privacy and confidentiality, advocacy, safety) seem to be missing [1-3]. * Adversarial data generation limitations. Most adversarial cases come from a single frontier model, which may introduce stylistic artefacts and attack-surface bias tied to that model (as shown by [4]). * Evaluation metrics could be richer. Accuracy and macro-F1 on imbalanced, ordinal-like l
S1: Good realistic dataset relating to clinical and nursing setting. - Authors carefully curated a real-world and diverse dataset describing nursing events happened in different types of hospital (rural, urban etc) with five-month field observational studies and five licensed nurse experts. - this resources could be very useful by being seed scenarios for many follow-up evaluations in this field S2: (Claimed) First work in nursing field to explore important topic (value alignment). - the value
[minor] w1 Missing procedure details for deriving nursing values from principles/rules - Since the study builds on the five nursing values summarized in Section 2, it is important to justify how the authors identified and distilled these values (see lines 145–146). - recommend authors at least providing some examples of rules for each identified value. Ideally, they would release a dataset of rules/principles mapped with values to help readers and community to better understand. [minor] w2 Lack
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsCounterfactuals Explanations
