Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning
Yuxuan Zhou, Yubin Wang, Bin Wang, Chen Ning, Xien Liu, Ji Wu, Jianye Hao

TL;DR
This paper introduces MuSeR, a self-refinement learning approach that significantly enhances large language models' medical context-awareness by simulating diverse scenarios, self-evaluating, and fine-tuning, leading to state-of-the-art performance.
Contribution
The paper presents a novel self-refinement training method that improves LLMs' medical context-awareness across decision-making, communication, and safety facets.
Findings
Significant performance improvements on HealthBench dataset.
Smaller models surpass their larger teachers with knowledge distillation.
Achieved new state-of-the-art results on open-source LLM benchmarks.
Abstract
Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks. However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness, i.e., the ability to recognize missing or critical details (e.g., user identity, medical history, risk factors) and provide safe, helpful, and contextually appropriate responses. To address this issue, we propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets (decision-making, communication, and safety) through self-evaluation and refinement. Specifically, we first design a attribute-conditioned query generator that simulates diverse real-world user contexts by varying attributes such as role, geographic region, intent, and degree of information ambiguity. An LLM then…
Peer Reviews
Decision·Submitted to ICLR 2026
The results are strong and well-documented. MuSeR delivers consistent, sizable gains on HealthBench across multiple backbones, with Qwen3-32B and Qwen3-14B improved to 63.8% and 61.8%, surpassing a stronger teacher and setting a new open-source SOTA; detailed plots also show improvements on the hard subset and across evaluation axes/themes, emphasizing context-awareness. The paper backs these outcomes with granular analyses: stage-wise and multi-faceted self-refinement and facet ablations. Overa
Weaknesses*: Backbone coverage is narrow; most results center on Qwen, so it’s unclear whether gains transfer to other families. Adding full runs on Llama-3/Mistral/Qwen2.5 (multiple sizes) would clarify generality. On novelty and positioning, the attribute-conditioned query synthesis overlaps with prior medically conditioned instruction tuning (e.g., AlpaCare’s diverse, synthetic medical instructions [1]); this prior work should be cited and contrasted to specify what is new here beyond the cho
1. This paper focuses on a data-driven pipeline to use multifaceted eval metrics to refine the generation. This process ensures the diversity and quality of the data used for further reinforcement training. 2. Their results show significant improvements of their pipeline on improving the health capabilities of smaller LLMs, making the data potentially useful for this domain. 3. Health is a domain that requires more careful eval, and this paper focuses on decision-making, communication, and saf
1. Why only use GPT-oss-120B as the teacher? how about using other models? GPT-oss-120B might not be the best 2. Why choose those three key facets (decision-making, communication, and safety) but not other dimensions? I want to learn more scientific justification for this choice.
- Authors point-out a well-motivated and timely issue in the medical LLM domain by emphasizing the distinction between exam-style benchmarks and real-world medical scenarios., where they focus on context-awareness makes the task notably harder yet more realistic and relevant for clinical applications. - According to the authors’ results, MuSeR outperforms several top-priority models and achieves closer performance with gpt-5-thinking while using much smaller backbones (e.g., Qwen3-14B). Interest
- The main limitation is the narrow evaluation setup. Although the authors conduct experiments with different model families and baselines for comparison, they evaluate solely on HealthBench, making it difficult to assess the framework's robustness. It would be beneficial to see results on at least 2-3 additional benchmarks. Moreover, HealthBench relies on GPT-4.1 as an LLM-as-a-Judge; such evaluations require statistical consistency tests, where the same experiments should be run multiple times
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling
