LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou,, Yuxiao Dong, Jie Tang, Juanzi Li

TL;DR
This paper introduces LongWriter, a method to extend the output length of long context LLMs beyond 10,000 words by creating specialized datasets and benchmarks, demonstrating that existing models have untapped potential for ultra-long generation.
Contribution
We propose AgentWrite and LongWriter-6k to significantly increase output lengths of LLMs, and develop LongBench-Write for evaluating ultra-long generation, achieving state-of-the-art results.
Findings
Models can generate over 10,000 words with proper training data.
AgentWrite decomposes tasks to enable ultra-long output generation.
State-of-the-art performance on ultra-long generation benchmark.
Abstract
Current long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model's effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other words, their output limitation is due to the scarcity of long-output examples in existing SFT datasets. To address this, we introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, we successfully scale the output length of existing…
Peer Reviews
Decision·ICLR 2025 Poster
- The focus of the paper is very clear. The paper presents very logical steps around the core idea of enabling LLMs to generate very long sequences. This makes the paper easy to understand (although I feel there might be too many closely related but different names such as LongWrite-Ruler, LongBench-Write, etc. that are a bit confusing). - The empirical results of demonstrating current LLMs’ cap at around 2k output words provide interesting insights. And connecting it back to the training data
- The novelty of the paper is somewhat limited. I like the idea of enabling the model to generate ultra-long sequences, but essentially it boils down to adjusting the training data length distribution to be better correlated with the testing scenario. Model behaviors are following what models are being trained on, so data augmentation or adjusting training data distribution is always a basic solution. - The evaluation of ultra-long generation quality is less satisfactory, as it mainly relies on
1. The paper presents a systematic approach towards investigating the reason behind limitations around long generations in LLMs. 2. The proposed agentic pipeline is pretty intuitive and seems to solve the issue very effectively. 3. Extensive validation checks and comparison of SOTA models against the finetuned models has been provided. 4. Good work with seeing the lift provided by DPO alignment and further ablation studies to strengthen the hypothesis.
1. Need for Human Eval to assess AgentWrite quality - While the paper proposes AgentWrite for generating long-form content, the validation of output quality is primarily based on automatic metrics using GPT-4o as a judge. More rigorous human evaluation would strengthen the quality assessment. 2. Dependency on proprietary models - The AgentWrite pipeline relies on GPT-4o for generating training data, which makes the approach dependent on proprietary models and potentially difficult to reproduce.
1. The paper is written clearly, making it easy to understand. 2. This paper focuses on the very practical issue of model output length limitations and provides a systematic research approach.
I believe there's room for improvement in the experimental aspect. 1. Is it possible to directly leverage AgentWrite to generate long responses with the target model, e.g., Llama-3.1-8B? Can it meet our requirements for output length and quality? Can the output be used to train the model? I think only having results from GPT4-o is not sufficient to demonstrate the effectiveness of AgentWrite. 2. As shown in Table 3, the performance of LongWriter-9B (w/ and w/o DPO) and LongWriter-8B is worse tha
Code & Models
- 🤗zai-org/LongWriter-llama3.1-8bmodel· 198 dl· ♡ 65198 dl♡ 65
- 🤗zai-org/LongWriter-glm4-9bmodel· 148 dl· ♡ 130148 dl♡ 130
- 🤗KnutJaegersberg/LongWriter-8b-exl-8.0bpwmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗QuantFactory/LongWriter-llama3.1-8b-GGUFmodel· 161 dl· ♡ 8161 dl♡ 8
- 🤗QuantFactory/LongWriter-glm4-9b-GGUFmodel· 259 dl· ♡ 10259 dl♡ 10
- 🤗RichardErkhov/THUDM_-_LongWriter-llama3.1-8b-ggufmodel· 89 dl· ♡ 189 dl♡ 1
- 🤗LoneStriker/LongWriter-llama3.1-8b-3.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/LongWriter-llama3.1-8b-4.0bpw-h6-exl2model
- 🤗LoneStriker/LongWriter-llama3.1-8b-5.0bpw-h6-exl2model· 1 dl1 dl
- 🤗LoneStriker/LongWriter-llama3.1-8b-6.0bpw-h6-exl2model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Mathematics, Computing, and Information Processing
MethodsDirect Preference Optimization · Shrink and Fine-Tune
