TL;DR
This paper introduces a retrieval-augmented multi-agent framework that automates the creation of detailed, evidence-based evaluation rubrics for medical dialogue systems, improving assessment accuracy and guiding response refinement.
Contribution
It presents a novel method for generating instance-specific, verifiable rubrics grounded in medical evidence, enhancing evaluation reliability and scalability.
Findings
Achieves higher Clinical Intent Alignment scores than GPT-4o baseline.
Demonstrates robust cross-lingual generalization.
Improves response quality by 9.2% using generated rubrics.
Abstract
Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety. These risks are hard to assess: subtle clinical errors are often missed by generic metrics and LLM judges using general criteria, while expert-authored fine-grained rubrics are expensive and difficult to scale. In this paper, we propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics. Our approach grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts and synthesizing them with user interaction constraints to form verifiable, fine-grained evaluation criteria. Evaluated on HealthBench and LLMEval-Med datasets, our framework achieves Clinical Intent Alignment (CIA) scores of 50.20% and 31.90%,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
