RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants   in the Biomedical Domain

William James Bolton; Rafael Poyiadzi; Edward R. Morrell; Gabriela van; Bergen Gonzalez Bueno; Lea Goetz

arXiv:2403.14578·cs.LG·March 22, 2024·1 cites

RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain

William James Bolton, Rafael Poyiadzi, Edward R. Morrell, Gabriela van, Bergen Gonzalez Bueno, Lea Goetz

PDF

Open Access 1 Repo

TL;DR

This paper introduces the RAmBLA framework to evaluate the reliability of large language models as biomedical assistants, focusing on robustness, recall, and hallucination avoidance, through designed tasks and semantic similarity evaluation.

Contribution

The paper presents a novel framework, RAmBLA, for assessing LLM reliability in biomedicine, including specific criteria and evaluation methods tailored for real-world use cases.

Findings

01

Four state-of-the-art LLMs evaluated for biomedical assistance.

02

Identified key criteria: prompt robustness, high recall, minimal hallucinations.

03

Evaluation methodology using semantic similarity and an evaluator LLM.

Abstract

Large Language Models (LLMs) increasingly support applications in a wide range of domains, some with potential high societal impact such as biomedicine, yet their reliability in realistic use cases is under-researched. In this work we introduce the Reliability AssesMent for Biomedical LLM Assistants (RAmBLA) framework and evaluate whether four state-of-the-art foundation LLMs can serve as reliable assistants in the biomedical domain. We identify prompt robustness, high recall, and a lack of hallucinations as necessary criteria for this use case. We design shortform tasks and tasks requiring LLM freeform responses mimicking real-world user interactions. We evaluate LLM performance using semantic similarity with a ground truth response, through an evaluator LLM.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gsk-ai/rambla
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies