Language Models can Self-Lengthen to Generate Long Texts

Shanghaoran Quan; Tianyi Tang; Bowen Yu; An Yang; Dayiheng Liu; Bofei; Gao; Jianhong Tu; Yichang Zhang; Jingren Zhou; Junyang Lin

arXiv:2410.23933·cs.CL·November 1, 2024

Language Models can Self-Lengthen to Generate Long Texts

Shanghaoran Quan, Tianyi Tang, Bowen Yu, An Yang, Dayiheng Liu, Bofei, Gao, Jianhong Tu, Yichang Zhang, Jingren Zhou, Junyang Lin

PDF

Open Access 1 Repo 1 Datasets 4 Reviews

TL;DR

This paper introduces Self-Lengthen, an iterative training framework enabling large language models to generate longer, more aligned texts without auxiliary data, outperforming existing methods on benchmarks and human evaluations.

Contribution

The paper presents a novel Self-Lengthen framework that leverages intrinsic model capabilities to improve long-text generation without auxiliary data or proprietary models.

Findings

01

Outperforms existing methods on benchmarks.

02

Effective in generating longer, aligned texts.

03

Applicable to open-source LLMs like Qwen2 and LLaMA3.

Abstract

Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to process long contexts, yet a notable gap remains in generating long, aligned outputs. This limitation stems from a training gap where pre-training lacks effective instructions for long-text generation, and post-training data primarily consists of short query-response pairs. Current approaches, such as instruction backtranslation and behavior imitation, face challenges including data quality, copyright issues, and constraints on proprietary model usage. In this paper, we introduce an innovative iterative training framework called Self-Lengthen that leverages only the intrinsic knowledge and skills of LLMs without the need for auxiliary data or proprietary models. The framework consists of two roles: the Generator and the Extender. The Generator produces the initial response, which is then…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

1. Self-Lengthen is cost-effective and easy to use. It only requires a set of seed instructions for long-text output tasks and an open-source instruction model to automatically enhance the model's ability to generate long-text outputs. 2. This paper proposes a two-stage extension method that ensures the extension does not end normally. This creates space for the model to seamlessly connect preceding and succeeding segments, thereby enhancing its ability to complete extension tasks. 3. The auth

Weaknesses

1. The paper adopts a rule-based approach to filter out invalid responses to ensure their quality. How is 'frequent repetition' specifically determined? Is it based on rules or scored by a more advanced LLM? If the extended responses merely describe the same meaning in different styles, is this kind of extension meaningful? 2. During the process of instruction evolution, some instruction data are inherently unsuitable for this type of evolution. For example, if the response to an instruction is

Reviewer 02Rating 3Confidence 4

Strengths

- This paper proposes a new method to improve language models’ long-form generation performance. - This topic is relevant to a wide range of applications which are bottlenecked by the response length that language models can reliably output.

Weaknesses

- The proposed method contains a seemingly arbitrary decision of truncating the response to ½ or ⅔ for further extension. It is unclear why these cutoffs were chosen and how they compare to other cutoffs. - The proposed method utilized surface form heuristics (e.g. length, repetition) to ensure the quality of extended responses, while the semantic content is not quality assured. It is unclear if training on synthetic self-generated data hurts other LM capabilities, e.g., math/code reasoning and

Reviewer 03Rating 3Confidence 4

Strengths

1. Compared to the previous method, Self-Lengthen has no need for auxiliary data or powerful proprietary models, and supports outputs with more diverse styles and types. 2. Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation when applied to top open-source LLMs such as Qwen2 and LLaMA3.

Weaknesses

1. The design of LonGen benchmark is too similar to the benchmark in LongWriter (i.e., LongBench-Write), and many tables (e.g., Table 2, 3) and figures (e.g., Fig 5, 6) are similar to those in LongWriter without proper citations. The authors should give a more detailed explanation and comparison. 2. There are many missing details in the experiments, including the calculation method of distinct scores, the training data statistics, and the supported maximum output length. 3. The length control

Reviewer 04Rating 3Confidence 5

Strengths

1, The proposed Self-Lengthen framework introduces a unique iterative approach to improve long-text generation by utilizing the intrinsic capabilities of LLMs without relying on additional external datasets or proprietary models. 2, The method is simple yet practical, focusing on leveraging existing models' capabilities through iterative extension. This makes the method easy to implement and potentially scalable to various domains where long-text generation is required. 3, The authors conduct

Weaknesses

1. **Motivation**: The motivation for this work is not sufficiently compelling. Regarding the instruction backtranslation method, SlimPajama already provides a large amount of long-text data generated by real-world [1]. For the behavior imitation approach, it is difficult to agree that there is a significant difference between using GPT-4 and open-source models. The LongWrite's agentwrite method can also use open-source models to generate data, which undermines the claimed uniqueness of Self-Len

Code & Models

Repositories

QwenLM/Self-Lengthen
noneOfficial

Datasets

quanshr/LonGen
dataset· 15 dl
15 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques