M-QALM: A Benchmark to Assess Clinical Reading Comprehension and   Knowledge Recall in Large Language Models via Question Answering

Anand Subramanian; Viktor Schlegel; Abhinav Ramesh Kashyap; Thanh-Tung; Nguyen; Vijay Prakash Dwivedi; Stefan Winkler

arXiv:2406.03699·cs.CL·June 7, 2024

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

Anand Subramanian, Viktor Schlegel, Abhinav Ramesh Kashyap, Thanh-Tung, Nguyen, Vijay Prakash Dwivedi, Stefan Winkler

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces M-QALM, a comprehensive benchmark for evaluating large language models' ability to recall and integrate clinical knowledge through question answering, revealing key success factors and gaps in current models.

Contribution

It provides a large-scale empirical study across multiple datasets and models, identifying factors like instruction tuning that enhance clinical knowledge recall and comprehension.

Findings

01

Instruction tuning improves model performance.

02

Domain-adapted models may lack sufficient knowledge.

03

Fine-tuning on medical datasets shows promising generalization.

Abstract

There is vivid research on adapting Large Language Models (LLMs) to perform a variety of tasks in high-stakes domains such as healthcare. Despite their popularity, there is a lack of understanding of the extent and contributing factors that allow LLMs to recall relevant knowledge and combine it with presented information in the clinical and biomedical domain: a fundamental pre-requisite for success on down-stream tasks. Addressing this gap, we use Multiple Choice and Abstractive Question Answering to conduct a large-scale empirical study on 22 datasets in three generalist and three specialist biomedical sub-domains. Our multifaceted analysis of the performance of 15 LLMs, further broken down by sub-domain, source of knowledge and model architecture, uncovers success factors such as instruction tuning that lead to improved recall and comprehension. We further show that while recently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anand-subu/m-qalm
noneOfficial

Videos

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering· underline

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification