One Thousand and One Pairs: A "novel" challenge for long-context language models
Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, Mohit Iyyer

TL;DR
This paper introduces NoCha, a challenging dataset for long-context language models that tests their ability to retrieve, synthesize, and reason over entire books, revealing significant gaps in current model capabilities.
Contribution
The paper presents NoCha, a novel dataset with 1,001 pairs of true and false claims about books, designed to evaluate global reasoning in long-context LLMs, and provides analysis of model performance and limitations.
Findings
Models perform near chance on global reasoning tasks.
GPT-4o achieves 55.8% accuracy, the highest among evaluated models.
Models struggle with speculative fiction and generate inaccurate explanations.
Abstract
Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
