MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification

Heyuan Huang; Alexandra DeLucia; Vijay Murari Tiyyala; Mark Dredze

arXiv:2505.18452·cs.CL·October 21, 2025

MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification

Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, Mark Dredze

PDF

Open Access 1 Repo 1 Datasets

TL;DR

MedScore is a domain-adapted, modular pipeline that improves factuality evaluation of free-form medical answers by better decomposing claims and verifying them against in-domain data, reducing hallucinations.

Contribution

It introduces MedScore, a novel, domain-specific factuality evaluation pipeline that enhances claim decomposition and verification for medical answers, outperforming existing methods.

Findings

01

Extracts up to three times more valid facts than previous methods.

02

Reduces hallucination and vague references in medical answer evaluation.

03

Factuality scores vary significantly with different decomposition and verification methods.

Abstract

While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new pipeline to decompose medical answers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

heyuan9/medscore
noneOfficial

Datasets

Heyuan9/AskDocsAI
dataset· 18 dl
18 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Misinformation and Its Impacts