LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

Yushi Bai; Jiajie Zhang; Xin Lv; Linzhi Zheng; Siqi Zhu; Lei Hou,; Yuxiao Dong; Jie Tang; Juanzi Li

arXiv:2408.07055·cs.CL·August 14, 2024·2 cites

LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou,, Yuxiao Dong, Jie Tang, Juanzi Li

PDF

Open Access 3 Repos 10 Models 4 Datasets 3 Reviews

TL;DR

This paper introduces LongWriter, a method to extend the output length of long context LLMs beyond 10,000 words by creating specialized datasets and benchmarks, demonstrating that existing models have untapped potential for ultra-long generation.

Contribution

We propose AgentWrite and LongWriter-6k to significantly increase output lengths of LLMs, and develop LongBench-Write for evaluating ultra-long generation, achieving state-of-the-art results.

Findings

01

Models can generate over 10,000 words with proper training data.

02

AgentWrite decomposes tasks to enable ultra-long output generation.

03

State-of-the-art performance on ultra-long generation benchmark.

Abstract

Current long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model's effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other words, their output limitation is due to the scarcity of long-output examples in existing SFT datasets. To address this, we introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, we successfully scale the output length of existing…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The focus of the paper is very clear. The paper presents very logical steps around the core idea of enabling LLMs to generate very long sequences. This makes the paper easy to understand (although I feel there might be too many closely related but different names such as LongWrite-Ruler, LongBench-Write, etc. that are a bit confusing). - The empirical results of demonstrating current LLMs’ cap at around 2k output words provide interesting insights. And connecting it back to the training data

Weaknesses

- The novelty of the paper is somewhat limited. I like the idea of enabling the model to generate ultra-long sequences, but essentially it boils down to adjusting the training data length distribution to be better correlated with the testing scenario. Model behaviors are following what models are being trained on, so data augmentation or adjusting training data distribution is always a basic solution. - The evaluation of ultra-long generation quality is less satisfactory, as it mainly relies on

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper presents a systematic approach towards investigating the reason behind limitations around long generations in LLMs. 2. The proposed agentic pipeline is pretty intuitive and seems to solve the issue very effectively. 3. Extensive validation checks and comparison of SOTA models against the finetuned models has been provided. 4. Good work with seeing the lift provided by DPO alignment and further ablation studies to strengthen the hypothesis.

Weaknesses

1. Need for Human Eval to assess AgentWrite quality - While the paper proposes AgentWrite for generating long-form content, the validation of output quality is primarily based on automatic metrics using GPT-4o as a judge. More rigorous human evaluation would strengthen the quality assessment. 2. Dependency on proprietary models - The AgentWrite pipeline relies on GPT-4o for generating training data, which makes the approach dependent on proprietary models and potentially difficult to reproduce.

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper is written clearly, making it easy to understand. 2. This paper focuses on the very practical issue of model output length limitations and provides a systematic research approach.

Weaknesses

I believe there's room for improvement in the experimental aspect. 1. Is it possible to directly leverage AgentWrite to generate long responses with the target model, e.g., Llama-3.1-8B? Can it meet our requirements for output length and quality? Can the output be used to train the model? I think only having results from GPT4-o is not sufficient to demonstrate the effectiveness of AgentWrite. 2. As shown in Table 3, the performance of LongWriter-9B (w/ and w/o DPO) and LongWriter-8B is worse tha

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Mathematics, Computing, and Information Processing

MethodsDirect Preference Optimization · Shrink and Fine-Tune