LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

Xi Ye; Fangcong Yin; Yinghui He; Joie Zhang; Howard Yen; Tianyu Gao; Greg Durrett; Danqi Chen

arXiv:2501.05414·cs.CL·September 30, 2025

LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

Xi Ye, Fangcong Yin, Yinghui He, Joie Zhang, Howard Yen, Tianyu Gao, Greg Durrett, Danqi Chen

PDF

Open Access 1 Datasets

TL;DR

LongProc is a new benchmark designed to evaluate long-context language models on complex procedural tasks requiring dispersed information synthesis and long-form generation, revealing current models' limitations in coherence and scalability.

Contribution

We introduce LongProc, a comprehensive benchmark with diverse tasks for assessing long-context language models' ability to handle procedural generation and structured outputs.

Findings

01

Reasoning models outperform others in long-form generation.

02

Open-weight models struggle with 2K and 8K token tasks.

03

Models show difficulty maintaining long-range coherence.

Abstract

Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus on long-context recall, requiring models to produce short responses based on a few critical snippets while processing thousands of irrelevant tokens. We introduce LongProc (Long Procedural Generation), a new benchmark that requires both the integration of highly dispersed information and long-form generation. LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. These tasks challenge LCLMs by testing their ability to follow detailed procedural instructions, synthesize and reason over dispersed information, and generate structured, long-form outputs (up to 8K tokens). Furthermore, as these tasks adhere to deterministic procedures and yield structured…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

PrincetonPLI/LongProc
dataset· 149 dl
149 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsEmirates Airlines Office in Dubai · Sparse Evolutionary Training · Focus