Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

Yinzhu Chen; Abdine Maiga; Hossein A. Rahmani; Emine Yilmaz

arXiv:2601.15161·cs.CL·May 14, 2026

Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

Yinzhu Chen, Abdine Maiga, Hossein A. Rahmani, Emine Yilmaz

PDF

1 Repo

TL;DR

This paper introduces a retrieval-augmented multi-agent framework that automates the creation of detailed, evidence-based evaluation rubrics for medical dialogue systems, improving assessment accuracy and guiding response refinement.

Contribution

It presents a novel method for generating instance-specific, verifiable rubrics grounded in medical evidence, enhancing evaluation reliability and scalability.

Findings

01

Achieves higher Clinical Intent Alignment scores than GPT-4o baseline.

02

Demonstrates robust cross-lingual generalization.

03

Improves response quality by 9.2% using generated rubrics.

Abstract

Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety. These risks are hard to assess: subtle clinical errors are often missed by generic metrics and LLM judges using general criteria, while expert-authored fine-grained rubrics are expensive and difficult to scale. In this paper, we propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics. Our approach grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts and synthesizing them with user interaction constraints to form verifiable, fine-grained evaluation criteria. Evaluated on HealthBench and LLMEval-Med datasets, our framework achieves Clinical Intent Alignment (CIA) scores of 50.20% and 31.90%,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AmbeChen/Automated-Rubric-Generation
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.