LongHealth: A Question Answering Benchmark with Long Clinical Documents

Lisa Adams; Felix Busch; Tianyu Han; Jean-Baptiste Excoffier; Matthieu; Ortala; Alexander L\"oser; Hugo JWL. Aerts; Jakob Nikolas Kather; Daniel; Truhn; Keno Bressem

arXiv:2401.14490·cs.CL·January 29, 2024·2 cites

LongHealth: A Question Answering Benchmark with Long Clinical Documents

Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu, Ortala, Alexander L\"oser, Hugo JWL. Aerts, Jakob Nikolas Kather, Daniel, Truhn, Keno Bressem

PDF

Open Access 1 Repo

TL;DR

LongHealth introduces a new benchmark with lengthy clinical documents to evaluate LLMs' ability to process and interpret complex healthcare data, revealing current limitations and areas for improvement.

Contribution

The paper presents the LongHealth benchmark, a novel dataset with long clinical cases and questions, to assess LLMs' performance on real-world healthcare tasks.

Findings

01

LLMs perform best on information retrieval tasks.

02

All models struggle with identifying missing information.

03

Current models are not yet reliable for clinical use.

Abstract

Background: Recent advancements in large language models (LLMs) offer potential benefits in healthcare, particularly in processing extensive patient records. However, existing benchmarks do not fully assess LLMs' capability in handling real-world, lengthy clinical data. Methods: We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases, with each case containing 5,090 to 6,754 words. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting, challenging LLMs to extract and interpret information from large clinical documents. Results: We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison. The highest accuracy was observed for Mixtral-8x7B-Instruct-v0.1,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kbressem/longhealth
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Artificial Intelligence in Healthcare and Education

MethodsAttention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Weight Decay · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Layer · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · {Dispute@FaQ-s}How to file a dispute with Expedia?