Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

Abu Noman Md Sakib; Md. Main Oddin Chisty; Zijie Zhang

arXiv:2604.19281·cs.HC·April 22, 2026

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

Abu Noman Md Sakib, Md. Main Oddin Chisty, Zijie Zhang

PDF

TL;DR

This paper introduces VB-Score, a new component-wise evaluation framework for medical question answering models that assesses entity recognition, semantic similarity, factual consistency, and information completeness, revealing significant performance disparities and health equity risks.

Contribution

The paper presents VB-Score, a novel evaluation method that uncovers critical shortcomings and disparities in current medical LLMs, emphasizing the need for more comprehensive assessments.

Findings

01

Models show major discrepancies between semantic and entity accuracy.

02

All models exhibit severe performance failures across evaluated components.

03

Public health disparities are evident, with lower accuracy for conditions affecting older and minority populations.

Abstract

The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.