One Thousand and One Pairs: A "novel" challenge for long-context   language models

Marzena Karpinska; Katherine Thai; Kyle Lo; Tanya Goyal; Mohit Iyyer

arXiv:2406.16264·cs.CL·October 23, 2024

One Thousand and One Pairs: A "novel" challenge for long-context language models

Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, Mohit Iyyer

PDF

Open Access 1 Repo

TL;DR

This paper introduces NoCha, a challenging dataset for long-context language models that tests their ability to retrieve, synthesize, and reason over entire books, revealing significant gaps in current model capabilities.

Contribution

The paper presents NoCha, a novel dataset with 1,001 pairs of true and false claims about books, designed to evaluate global reasoning in long-context LLMs, and provides analysis of model performance and limitations.

Findings

01

Models perform near chance on global reasoning tasks.

02

GPT-4o achieves 55.8% accuracy, the highest among evaluated models.

03

Models struggle with speculative fiction and generate inaccurate explanations.

Abstract

Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

marzenakrp/nocha
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling