AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation

Xiechi Zhang; Zetian Ouyang; Linlin Wang; Gerard de Melo; Zhu Cao; Xiaoling Wang; Ya Zhang; Yanfeng Wang; Liang He

arXiv:2505.11887·cs.CL·May 20, 2025

AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation

Xiechi Zhang, Zetian Ouyang, Linlin Wang, Gerard de Melo, Zhu Cao, Xiaoling Wang, Ya Zhang, Yanfeng Wang, Liang He

PDF

Open Access

TL;DR

AutoMedEval is an open-source, 13B-parameter model designed to automatically evaluate medical language models' question-answering abilities, aiming to reduce reliance on costly human assessments.

Contribution

It introduces a hierarchical training approach with curriculum tuning and knowledge introspection, enabling effective medical evaluation with limited data.

Findings

01

AutoMedEval outperforms baselines in correlating with human judgments.

02

It effectively assesses diverse medical LLMs' responses.

03

The model reduces the need for human evaluation in medical AI assessment.

Abstract

With the proliferation of large language models (LLMs) in the medical domain, there is increasing demand for improved evaluation techniques to assess their capabilities. However, traditional metrics like F1 and ROUGE, which rely on token overlaps to measure quality, significantly overlook the importance of medical terminology. While human evaluation tends to be more reliable, it can be very costly and may as well suffer from inaccuracies due to limits in human expertise and motivation. Although there are some evaluation methods based on LLMs, their usability in the medical field is limited due to their proprietary nature or lack of expertise. To tackle these challenges, we present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs. The overarching objective of AutoMedEval is to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare