LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Yuhao Wu; Ming Shan Hee; Zhiqing Hu; Roy Ka-Wei Lee

arXiv:2409.02076·cs.CL·January 24, 2025

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee

PDF

Open Access 2 Repos 3 Reviews

TL;DR

LongGenBench is a new benchmark designed to evaluate large language models' ability to generate high-quality, long-form text that follows complex instructions, revealing current models' limitations in long text generation.

Contribution

The paper introduces LongGenBench, a comprehensive benchmark specifically targeting long-form text generation in LLMs, addressing a gap in existing evaluation methods.

Findings

01

Models perform poorly on long text generation tasks, especially at longer lengths.

02

Current LLMs struggle with maintaining coherence and adhering to instructions over extended sequences.

03

Benchmark reveals significant gaps in LLM capabilities for real-world long-form content creation.

Abstract

Current benchmarks like Needle-in-a-Haystack (NIAH), Ruler, and Needlebench focus on models' ability to understand long-context input sequences but fail to capture a critical dimension: the generation of high-quality long-form text. Applications such as design proposals, technical documentation, and creative writing rely on coherent, instruction-following outputs over extended sequences - a challenge that existing benchmarks do not adequately address. To fill this gap, we introduce LongGenBench, a novel benchmark designed to rigorously evaluate large language models' (LLMs) ability to generate long text while adhering to complex instructions. Through tasks requiring specific events or constraints within generated text, LongGenBench evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens). Our evaluation of ten…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- First benchmark focusing on long-form generation during the test time - The evaluation combines both complexities of evaluation prompts and different scenarios - First batch of results on 10 mainstreamed LLMs - The paper is easy to follow

Weaknesses

- I am a little bit distracted from the main takeaways from the experimental studies, and not so convinced with failure cases. See question 1 I have other minor concerns regarding the experiment setup - There has been much research showing that the prompt format matters, what's your thought? - Reasoning tasks are not well involved, as o1 seems to argue that longer decoded length is helpful with reasoning complex tasks, in your benchmark, you might want to add an axis of reasoning ability cle

Reviewer 02Rating 8Confidence 3

Strengths

* This paper is overall clear and easy to understand. * This proposed evaluation is novel, and the generation ability it benchmarks is not covered by previous metrics.

Weaknesses

* Some of the details are possibly missing or hard to get by readers -- see "Questions". * In the proposed benchmark, the way to form long content is to pile short answers to many sub-queries, while the sub-tasks are actually independent, to a large extent. For example, given all the demands on one-year dairies, it should be easy for LLMs to write a diary if it is assigned a specific day of the year, while this benchmark just require the LLM generate 365 diaries all at once. In this case, the ch

Reviewer 03Rating 5Confidence 4

Strengths

1. Interesting task design, which can evaluate the long text generation ability of large models from a certain perspective 2. The paper is well written.

Weaknesses

1. The types of task scenarios are relatively limited, and it is impossible to comprehensively evaluate the long text generation capabilities of large models. 2. The evaluation metrics seem to be customized according to the scenario. 3. Limited number of models evaluated

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management

MethodsFocus