Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning

Yuxuan Zhou; Yubin Wang; Bin Wang; Chen Ning; Xien Liu; Ji Wu; Jianye Hao

arXiv:2511.10067·cs.AI·November 17, 2025

Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning

Yuxuan Zhou, Yubin Wang, Bin Wang, Chen Ning, Xien Liu, Ji Wu, Jianye Hao

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces MuSeR, a self-refinement learning approach that significantly enhances large language models' medical context-awareness by simulating diverse scenarios, self-evaluating, and fine-tuning, leading to state-of-the-art performance.

Contribution

The paper presents a novel self-refinement training method that improves LLMs' medical context-awareness across decision-making, communication, and safety facets.

Findings

01

Significant performance improvements on HealthBench dataset.

02

Smaller models surpass their larger teachers with knowledge distillation.

03

Achieved new state-of-the-art results on open-source LLM benchmarks.

Abstract

Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks. However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness, i.e., the ability to recognize missing or critical details (e.g., user identity, medical history, risk factors) and provide safe, helpful, and contextually appropriate responses. To address this issue, we propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets (decision-making, communication, and safety) through self-evaluation and refinement. Specifically, we first design a attribute-conditioned query generator that simulates diverse real-world user contexts by varying attributes such as role, geographic region, intent, and degree of information ambiguity. An LLM then…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The results are strong and well-documented. MuSeR delivers consistent, sizable gains on HealthBench across multiple backbones, with Qwen3-32B and Qwen3-14B improved to 63.8% and 61.8%, surpassing a stronger teacher and setting a new open-source SOTA; detailed plots also show improvements on the hard subset and across evaluation axes/themes, emphasizing context-awareness. The paper backs these outcomes with granular analyses: stage-wise and multi-faceted self-refinement and facet ablations. Overa

Weaknesses

Weaknesses*: Backbone coverage is narrow; most results center on Qwen, so it’s unclear whether gains transfer to other families. Adding full runs on Llama-3/Mistral/Qwen2.5 (multiple sizes) would clarify generality. On novelty and positioning, the attribute-conditioned query synthesis overlaps with prior medically conditioned instruction tuning (e.g., AlpaCare’s diverse, synthetic medical instructions [1]); this prior work should be cited and contrasted to specify what is new here beyond the cho

Reviewer 02Rating 4Confidence 5

Strengths

1. This paper focuses on a data-driven pipeline to use multifaceted eval metrics to refine the generation. This process ensures the diversity and quality of the data used for further reinforcement training. 2. Their results show significant improvements of their pipeline on improving the health capabilities of smaller LLMs, making the data potentially useful for this domain. 3. Health is a domain that requires more careful eval, and this paper focuses on decision-making, communication, and saf

Weaknesses

1. Why only use GPT-oss-120B as the teacher? how about using other models? GPT-oss-120B might not be the best 2. Why choose those three key facets (decision-making, communication, and safety) but not other dimensions? I want to learn more scientific justification for this choice.

Reviewer 03Rating 4Confidence 3

Strengths

- Authors point-out a well-motivated and timely issue in the medical LLM domain by emphasizing the distinction between exam-style benchmarks and real-world medical scenarios., where they focus on context-awareness makes the task notably harder yet more realistic and relevant for clinical applications. - According to the authors’ results, MuSeR outperforms several top-priority models and achieves closer performance with gpt-5-thinking while using much smaller backbones (e.g., Qwen3-14B). Interest

Weaknesses

- The main limitation is the narrow evaluation setup. Although the authors conduct experiments with different model families and baselines for comparison, they evaluate solely on HealthBench, making it difficult to assess the framework's robustness. It would be beneficial to see results on at least 2-3 additional benchmarks. Moreover, HealthBench relies on GPT-4.1 as an LLM-as-a-Judge; such evaluations require statistical consistency tests, where the same experiments should be run multiple times

Code & Models

Datasets

zyx1234/MuSeR_GPT_OSS_120B_Distillation
dataset· 18 dl
18 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling