LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee

TL;DR
LongGenBench is a new benchmark designed to evaluate large language models' ability to generate high-quality, long-form text that follows complex instructions, revealing current models' limitations in long text generation.
Contribution
The paper introduces LongGenBench, a comprehensive benchmark specifically targeting long-form text generation in LLMs, addressing a gap in existing evaluation methods.
Findings
Models perform poorly on long text generation tasks, especially at longer lengths.
Current LLMs struggle with maintaining coherence and adhering to instructions over extended sequences.
Benchmark reveals significant gaps in LLM capabilities for real-world long-form content creation.
Abstract
Current benchmarks like Needle-in-a-Haystack (NIAH), Ruler, and Needlebench focus on models' ability to understand long-context input sequences but fail to capture a critical dimension: the generation of high-quality long-form text. Applications such as design proposals, technical documentation, and creative writing rely on coherent, instruction-following outputs over extended sequences - a challenge that existing benchmarks do not adequately address. To fill this gap, we introduce LongGenBench, a novel benchmark designed to rigorously evaluate large language models' (LLMs) ability to generate long text while adhering to complex instructions. Through tasks requiring specific events or constraints within generated text, LongGenBench evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens). Our evaluation of ten…
Peer Reviews
Decision·ICLR 2025 Poster
- First benchmark focusing on long-form generation during the test time - The evaluation combines both complexities of evaluation prompts and different scenarios - First batch of results on 10 mainstreamed LLMs - The paper is easy to follow
- I am a little bit distracted from the main takeaways from the experimental studies, and not so convinced with failure cases. See question 1 I have other minor concerns regarding the experiment setup - There has been much research showing that the prompt format matters, what's your thought? - Reasoning tasks are not well involved, as o1 seems to argue that longer decoded length is helpful with reasoning complex tasks, in your benchmark, you might want to add an axis of reasoning ability cle
* This paper is overall clear and easy to understand. * This proposed evaluation is novel, and the generation ability it benchmarks is not covered by previous metrics.
* Some of the details are possibly missing or hard to get by readers -- see "Questions". * In the proposed benchmark, the way to form long content is to pile short answers to many sub-queries, while the sub-tasks are actually independent, to a large extent. For example, given all the demands on one-year dairies, it should be easy for LLMs to write a diary if it is assigned a specific day of the year, while this benchmark just require the LLM generate 365 diaries all at once. In this case, the ch
1. Interesting task design, which can evaluate the long text generation ability of large models from a certain perspective 2. The paper is well written.
1. The types of task scenarios are relatively limited, and it is impossible to comprehensively evaluate the long text generation capabilities of large models. 2. The evaluation metrics seem to be customized according to the scenario. 3. Limited number of models evaluated
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management
MethodsFocus
