ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding
Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, Omer, Levy

TL;DR
ZeroSCROLLS is a new zero-shot benchmark designed to evaluate large language models' ability to understand long texts across diverse tasks without training data, highlighting current model strengths and challenges.
Contribution
The paper introduces ZeroSCROLLS, a novel zero-shot benchmark with new datasets and tasks for long text understanding, and provides a comprehensive evaluation of existing large language models.
Findings
Claude outperforms ChatGPT on ZeroSCROLLS
GPT-4 achieves the highest average score among evaluated models
Models struggle with aggregation tasks, indicating room for improvement.
Abstract
We introduce ZeroSCROLLS, a zero-shot benchmark for natural language understanding over long texts, which contains only test and small validation sets, without training data. We adapt six tasks from the SCROLLS benchmark, and add four new datasets, including two novel information fusing tasks, such as aggregating the percentage of positive reviews. Using ZeroSCROLLS, we conduct a comprehensive evaluation of both open-source and closed large language models, finding that Claude outperforms ChatGPT, and that GPT-4 achieves the highest average score. However, there is still room for improvement on multiple open challenges in ZeroSCROLLS, such as aggregation tasks, where models struggle to pass the naive baseline. As the state of the art is a moving target, we invite researchers to evaluate their ideas on the live ZeroSCROLLS leaderboard.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Test · Absolute Position Encodings · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer · Label Smoothing
