RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain
William James Bolton, Rafael Poyiadzi, Edward R. Morrell, Gabriela van, Bergen Gonzalez Bueno, Lea Goetz

TL;DR
This paper introduces the RAmBLA framework to evaluate the reliability of large language models as biomedical assistants, focusing on robustness, recall, and hallucination avoidance, through designed tasks and semantic similarity evaluation.
Contribution
The paper presents a novel framework, RAmBLA, for assessing LLM reliability in biomedicine, including specific criteria and evaluation methods tailored for real-world use cases.
Findings
Four state-of-the-art LLMs evaluated for biomedical assistance.
Identified key criteria: prompt robustness, high recall, minimal hallucinations.
Evaluation methodology using semantic similarity and an evaluator LLM.
Abstract
Large Language Models (LLMs) increasingly support applications in a wide range of domains, some with potential high societal impact such as biomedicine, yet their reliability in realistic use cases is under-researched. In this work we introduce the Reliability AssesMent for Biomedical LLM Assistants (RAmBLA) framework and evaluate whether four state-of-the-art foundation LLMs can serve as reliable assistants in the biomedical domain. We identify prompt robustness, high recall, and a lack of hallucinations as necessary criteria for this use case. We design shortform tasks and tasks requiring LLM freeform responses mimicking real-world user interactions. We evaluate LLM performance using semantic similarity with a ground truth response, through an evaluator LLM.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies
