MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens
Yongqi Fan, Hongli Sun, Kui Xue, Xiaofan Zhang, Shaoting Zhang, Tong, Ruan

TL;DR
MedOdyssey introduces a comprehensive long-context benchmark for medical LLMs, covering up to 200K tokens, to evaluate their performance on complex, domain-specific tasks requiring extensive context understanding.
Contribution
This paper presents the first medical long-context benchmark with multiple difficulty levels and specialized tasks, addressing the unique challenges of medical text processing.
Findings
LLMs still struggle with very long medical contexts
The benchmark reveals gaps in current LLM capabilities for medical applications
Professional medical expertise remains essential for accurate performance
Abstract
Numerous advanced Large Language Models (LLMs) now support context lengths up to 128K, and some extend to 200K. Some benchmarks in the generic domain have also followed up on evaluating long-context capabilities. In the medical domain, tasks are distinctive due to the unique contexts and need for domain expertise, necessitating further evaluation. However, despite the frequent presence of long texts in medical scenarios, evaluation benchmarks of long-context capabilities for LLMs in this field are still rare. In this paper, we propose MedOdyssey, the first medical long-context benchmark with seven length levels ranging from 4K to 200K tokens. MedOdyssey consists of two primary components: the medical-context "needles in a haystack" task and a series of tasks specific to medical applications, together comprising 10 datasets. The first component includes challenges such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning in Healthcare · Radiomics and Machine Learning in Medical Imaging · COVID-19 diagnosis using AI
