Integrating Planning into Single-Turn Long-Form Text Generation
Yi Liang, You Wu, Honglei Zhuang, Li Chen, Jiaming Shen, Yiling Jia,, Zhen Qin, Sumit Sanghai, Xuanhui Wang, Carl Yang, Michael Bendersky

TL;DR
This paper introduces a planning-based auxiliary task for large language models to improve the quality of long-form text generation, demonstrating significant enhancements in coherence and relevance across multiple datasets.
Contribution
It presents a novel single auxiliary task that enables LLMs to plan and structure content without multiple prompting rounds, utilizing synthetic data to improve long-form text generation.
Findings
+2.5% ROUGE-Lsum improvement
3.60 win/loss ratio in human evaluations
Enhanced organization, relevance, and verifiability
Abstract
Generating high-quality, in-depth textual documents, such as academic papers, news articles, Wikipedia entries, and books, remains a significant challenge for Large Language Models (LLMs). In this paper, we propose to use planning to generate long form content. To achieve our goal, we generate intermediate steps via an auxiliary task that teaches the LLM to plan, reason and structure before generating the final text. Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning. To overcome the scarcity of training data for these intermediate steps, we leverage LLMs to generate synthetic intermediate writing data such as outlines, key information and summaries from existing full articles. Our experiments demonstrate on two datasets from different domains, namely the scientific news dataset SciNews and Wikipedia datasets in KILT-Wiki and…
Peer Reviews
Decision·Submitted to ICLR 2025
The authors have validated the effectiveness of their method through both automated metrics and human evaluation, showcasing the method's practical utility.
1. There is significant room for improvement in the writing style and structure of the paper. For example, Sections 4.1, 4.2, and 5.1 are overly verbose and could benefit from a more concise presentation, with implementation details potentially moved to the appendix. The organization of the paper is somewhat disjointed, failing to highlight key points effectively. For instance, the second task introduced in section 4.3 is mentioned as ineffective in the experiments (Table 2); thus, its inclusion
- Very well-written paper! The proposed method is explained in a clear way, with the difference of other objectives (generating in multiple turns) distinguished. - The proposed method’s intermediate steps are all in scope, and all intuitively should help with full document generation. They also try to mitigate hallucinations by using a score function of coherence and completeness to choose the highest overall quality. As confirmed by both automatic metrics as well as human/LLM as judge metric, t
- For evaluation metrics, although human and auto SxS are used, only 50 articles are rated by human. Furthermore, the LLM that is used to generate the synthetic data is used as the rater. As suggests by some previous work (Panickssery et al. 2024), for example, LLM raters might be able to recognize and favor their own generations. Therefore, there are chances that the model that is trained on additional Gemini Ultra generated data might be favored more than zero-shot. - Length impact — it seems
Pros: * The paper is well-written and easy to follow. * The construction of the training data is automatic, leveraging off-the-shelf LLMs to generate synthetic intermediate writing data based on the full articles, which saves the expensive cost of manual annotation. * With the auxiliary task at training time, the model can learn to generate more coherent and structured long-form text in a single pass.
Cons: * The paper lacks a comparison with previous work like ProGen (Tan et al., 2021), Ex$^3$ (Huang et al., 2024), etc. The invloved baseline training and prompt setups are too simple to demonstrate the superiority of the proposed approach. * Besides the cohenrence and structure, the approach in this paper appears to be more efficient and faster than the previous work. The paper lacks metrics to measure the efficiency and speed of the proposed approach compared to previous work. * The offline
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Natural Language Processing Techniques · Software Engineering Research
