LongHealth: A Question Answering Benchmark with Long Clinical Documents
Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu, Ortala, Alexander L\"oser, Hugo JWL. Aerts, Jakob Nikolas Kather, Daniel, Truhn, Keno Bressem

TL;DR
LongHealth introduces a new benchmark with lengthy clinical documents to evaluate LLMs' ability to process and interpret complex healthcare data, revealing current limitations and areas for improvement.
Contribution
The paper presents the LongHealth benchmark, a novel dataset with long clinical cases and questions, to assess LLMs' performance on real-world healthcare tasks.
Findings
LLMs perform best on information retrieval tasks.
All models struggle with identifying missing information.
Current models are not yet reliable for clinical use.
Abstract
Background: Recent advancements in large language models (LLMs) offer potential benefits in healthcare, particularly in processing extensive patient records. However, existing benchmarks do not fully assess LLMs' capability in handling real-world, lengthy clinical data. Methods: We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases, with each case containing 5,090 to 6,754 words. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting, challenging LLMs to extract and interpret information from large clinical documents. Results: We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison. The highest accuracy was observed for Mixtral-8x7B-Instruct-v0.1,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling · Artificial Intelligence in Healthcare and Education
MethodsAttention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Weight Decay · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Layer · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · {Dispute@FaQ-s}How to file a dispute with Expedia?
