Marathon: A Race Through the Realm of Long Context with Large Language   Models

Lei Zhang; Yunshui Li; Ziqiang Liu; Jiaxi yang; Junhao Liu; Longze; Chen; Run Luo; Min Yang

arXiv:2312.09542·cs.CL·June 27, 2024·2 cites

Marathon: A Race Through the Realm of Long Context with Large Language Models

Lei Zhang, Yunshui Li, Ziqiang Liu, Jiaxi yang, Junhao Liu, Longze, Chen, Run Luo, Min Yang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

Marathon introduces a new multiple-choice benchmark to accurately and efficiently evaluate large language models' comprehension and reasoning abilities over long contexts, addressing limitations of previous benchmarks.

Contribution

It presents a novel evaluation benchmark with a multiple-choice format that improves accuracy, speed, and fairness in assessing long-context understanding of LLMs.

Findings

01

Effective evaluation of LLMs' long-context comprehension

02

Benchmark outperforms traditional F1-based assessments

03

Supports assessment of optimization strategies for long-context generation

Abstract

With the advancement of large language models (LLMs) and the expansion of their context windows, existing long-context benchmarks fall short in effectively evaluating the models' comprehension and reasoning abilities in extended texts. Moreover, conventional benchmarks relying on F1 metrics often inaccurately score responses: they may undervalue correct answers that differ from the reference responses and overvalue incorrect ones that resemble the reference texts. In response to these limitations, we introduce Marathon, a novel evaluation benchmark that adopts a multiple-choice question format. It is specifically designed to overcome the constraints of previous benchmarks and provide a rapid, precise, and unbiased appraisal of the long-context comprehension skills of large language models. We conducted comprehensive evaluations on the Marathon benchmark with a range of state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hambaobao/marathon
pytorchOfficial

Datasets

Hambaobao/Marathon
dataset· 44 dl
44 dl

Videos

Marathon: A Race Through the Realm of Long Context with Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications