MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to   200K Tokens

Yongqi Fan; Hongli Sun; Kui Xue; Xiaofan Zhang; Shaoting Zhang; Tong; Ruan

arXiv:2406.15019·cs.CL·June 24, 2024

MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

Yongqi Fan, Hongli Sun, Kui Xue, Xiaofan Zhang, Shaoting Zhang, Tong, Ruan

PDF

Open Access 1 Repo 1 Video

TL;DR

MedOdyssey introduces a comprehensive long-context benchmark for medical LLMs, covering up to 200K tokens, to evaluate their performance on complex, domain-specific tasks requiring extensive context understanding.

Contribution

This paper presents the first medical long-context benchmark with multiple difficulty levels and specialized tasks, addressing the unique challenges of medical text processing.

Findings

01

LLMs still struggle with very long medical contexts

02

The benchmark reveals gaps in current LLM capabilities for medical applications

03

Professional medical expertise remains essential for accurate performance

Abstract

Numerous advanced Large Language Models (LLMs) now support context lengths up to 128K, and some extend to 200K. Some benchmarks in the generic domain have also followed up on evaluating long-context capabilities. In the medical domain, tasks are distinctive due to the unique contexts and need for domain expertise, necessitating further evaluation. However, despite the frequent presence of long texts in medical scenarios, evaluation benchmarks of long-context capabilities for LLMs in this field are still rare. In this paper, we propose MedOdyssey, the first medical long-context benchmark with seven length levels ranging from 4K to 200K tokens. MedOdyssey consists of two primary components: the medical-context "needles in a haystack" task and a series of tasks specific to medical applications, together comprising 10 datasets. The first component includes challenges such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

johnny-fans/medodyssey
noneOfficial

Videos

MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens· underline

Taxonomy

TopicsMachine Learning in Healthcare · Radiomics and Machine Learning in Medical Imaging · COVID-19 diagnosis using AI