Language Models are Surprisingly Fragile to Drug Names in Biomedical   Benchmarks

Jack Gallifant; Shan Chen; Pedro Moreira; Nikolaj Munch; Mingye Gao,; Jackson Pond; Leo Anthony Celi; Hugo Aerts; Thomas Hartvigsen; Danielle; Bitterman

arXiv:2406.12066·cs.CL·June 21, 2024·1 cites

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Jack Gallifant, Shan Chen, Pedro Moreira, Nikolaj Munch, Mingye Gao,, Jackson Pond, Leo Anthony Celi, Hugo Aerts, Thomas Hartvigsen, Danielle, Bitterman

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces RABBITS, a robustness dataset revealing that language models' performance on biomedical benchmarks significantly drops when drug names are swapped between brand and generic terms, highlighting their fragility.

Contribution

The paper creates RABBITS, a new dataset to evaluate LLM robustness to drug name variations, and demonstrates performance drops and potential data contamination issues in biomedical NLP models.

Findings

01

Performance drops of 1-10% after drug name swapping.

02

Test data contamination in pre-training datasets.

03

Open-source and API-based LLMs show similar fragility.

Abstract

Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10\%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets. All code is accessible at https://github.com/BittermanLab/RABBITS, and a HuggingFace leaderboard is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bittermanlab/rabbits
noneOfficial

Datasets

AIM-Harvard/rabbit_b4bqa
dataset· 4 dl
4 dl

Videos

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks· underline

Taxonomy

TopicsGenomics and Rare Diseases · Biomedical Text Mining and Ontologies