Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs

Zhichao Yang; Sepehr Janghorbani; Dongxu Zhang; Jun Han; Qian Qian; Andrew Ressler II; Gregory D. Lyng; Sanjit Singh Batra; Robert E. Tillman

arXiv:2601.18706·cs.AI·January 27, 2026

Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs

Zhichao Yang, Sepehr Janghorbani, Dongxu Zhang, Jun Han, Qian Qian, Andrew Ressler II, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman

PDF

Open Access

TL;DR

Health-SCORE introduces a scalable rubric-based framework for evaluating and training healthcare-related language models, reducing development costs while maintaining high evaluation quality and enabling safety-aware reinforcement learning.

Contribution

It presents a generalizable, cost-effective rubric framework that enhances model evaluation and training in healthcare without sacrificing performance.

Findings

01

Health-SCORE matches human rubric evaluation quality.

02

It reduces rubric development effort significantly.

03

It enables safety-aware reinforcement learning and improved in-context learning.

Abstract

Rubrics are essential for evaluating open-ended LLM responses, especially in safety-critical domains such as healthcare. However, creating high-quality and domain-specific rubrics typically requires significant human expertise time and development cost, making rubric-based evaluation and training difficult to scale. In this work, we introduce Health-SCORE, a generalizable and scalable rubric-based training and evaluation framework that substantially reduces rubric development costs without sacrificing performance. We show that Health-SCORE provides two practical benefits beyond standalone evaluation: it can be used as a structured reward signal to guide reinforcement learning with safety-aware supervision, and it can be incorporated directly into prompts to improve response quality through in-context learning. Across open-ended healthcare tasks, Health-SCORE achieves evaluation quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Reinforcement Learning in Robotics · Robot Manipulation and Learning