Performance of large language models in numerical vs. semantic medical   knowledge: Benchmarking on evidence-based Q&As

Eden Avnat; Michal Levy; Daniel Herstain; Elia Yanko; Daniel Ben Joya,; Michal Tzuchman Katz; Dafna Eshel; Sahar Laros; Yael Dagan; Shahar Barami,; Joseph Mermelstein; Shahar Ovadia; Noam Shomron; Varda Shalev; Raja-Elie; E. Abdulnour

arXiv:2406.03855·cs.CL·July 25, 2024

Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As

Eden Avnat, Michal Levy, Daniel Herstain, Elia Yanko, Daniel Ben Joya,, Michal Tzuchman Katz, Dafna Eshel, Sahar Laros, Yael Dagan, Shahar Barami,, Joseph Mermelstein, Shahar Ovadia, Noam Shomron, Varda Shalev, Raja-Elie, E. Abdulnour

PDF

Open Access

TL;DR

This study benchmarks large language models' ability to answer evidence-based medical questions, revealing they perform better on semantic than numerical questions but still lag behind human experts, highlighting limitations in clinical decision support.

Contribution

The paper introduces EBMQA, a large-scale medical question dataset, and compares LLM performance on numerical and semantic questions, highlighting their strengths and limitations in clinical knowledge application.

Findings

01

LLMs perform better on semantic questions than numerical ones.

02

Claude3 outperforms GPT-4 in numerical question accuracy.

03

Both LLMs are inferior to human experts in medical question answering.

Abstract

Clinical problem-solving requires processing of semantic medical knowledge such as illness scripts and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate non-language evidence-based answers to clinical questions is inherently limited by tokenization. Therefore, we evaluated LLMs' performance on two question types: numeric (correlating findings) and semantic (differentiating entities) while examining differences within and between LLMs in medical aspects and comparing their performance to humans. To generate straightforward multi-choice questions and answers (QAs) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Biomedical Text Mining and Ontologies