Marathon: A Race Through the Realm of Long Context with Large Language Models
Lei Zhang, Yunshui Li, Ziqiang Liu, Jiaxi yang, Junhao Liu, Longze, Chen, Run Luo, Min Yang

TL;DR
Marathon introduces a new multiple-choice benchmark to accurately and efficiently evaluate large language models' comprehension and reasoning abilities over long contexts, addressing limitations of previous benchmarks.
Contribution
It presents a novel evaluation benchmark with a multiple-choice format that improves accuracy, speed, and fairness in assessing long-context understanding of LLMs.
Findings
Effective evaluation of LLMs' long-context comprehension
Benchmark outperforms traditional F1-based assessments
Supports assessment of optimization strategies for long-context generation
Abstract
With the advancement of large language models (LLMs) and the expansion of their context windows, existing long-context benchmarks fall short in effectively evaluating the models' comprehension and reasoning abilities in extended texts. Moreover, conventional benchmarks relying on F1 metrics often inaccurately score responses: they may undervalue correct answers that differ from the reference responses and overvalue incorrect ones that resemble the reference texts. In response to these limitations, we introduce Marathon, a novel evaluation benchmark that adopts a multiple-choice question format. It is specifically designed to overcome the constraints of previous benchmarks and provide a rapid, precise, and unbiased appraisal of the long-context comprehension skills of large language models. We conducted comprehensive evaluations on the Marathon benchmark with a range of state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
