TL;DR
This paper introduces HDCEval, a hierarchical evaluation framework for medical LLMs that decomposes complex tasks into subtasks evaluated by expert models, improving alignment with human judgment in clinical settings.
Contribution
The paper presents a novel hierarchical evaluation framework with fine-grained medical guidelines and expert model training, enhancing the accuracy of LLM assessments in healthcare.
Findings
HDCEval improves alignment with human evaluators in medical assessments.
Hierarchical decomposition enhances evaluation precision across multiple medical criteria.
Expert model training via Attribute-Driven Token Optimization boosts evaluation reliability.
Abstract
In the rapidly evolving landscape of large language models (LLMs) for medical applications, ensuring the reliability and accuracy of these models in clinical settings is paramount. Existing benchmarks often focus on fixed-format tasks like multiple-choice QA, which fail to capture the complexity of real-world clinical diagnostics. Moreover, traditional evaluation metrics and LLM-based evaluators struggle with misalignment, often providing oversimplified assessments that do not adequately reflect human judgment. To address these challenges, we introduce HDCEval, a Hierarchical Divide-and-Conquer Evaluation framework tailored for fine-grained alignment in medical evaluation. HDCEval is built on a set of fine-grained medical evaluation guidelines developed in collaboration with professional doctors, encompassing Patient Question Relevance, Medical Knowledge Correctness, and Expression. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsSparse Evolutionary Training · Focus
