Prompt-Based Length Controlled Generation with Reinforcement Learning
Renlong Jie, Xiaojun Meng, Lifeng Shang, Xin Jiang, Qun Liu

TL;DR
This paper introduces a prompt-based reinforcement learning approach to achieve accurate length-controlled text generation with large language models, improving efficiency and customization in NLP tasks.
Contribution
It presents a novel reinforcement learning method with reward models for precise length control in LLMs, along with a prompt extractor for rule-based inference.
Findings
Significantly improves length control accuracy in summarization tasks
Demonstrates strong generalization to unseen control prompts
Reduces inference cost by limiting generated length
Abstract
Large language models (LLMs) like ChatGPT and GPT-4 have attracted great attention given their surprising performance on a wide range of NLP tasks. Length controlled generation of LLMs emerges as an important topic, which enables users to fully leverage the capability of LLMs in more real-world scenarios like generating a proper answer or essay of a desired length. In addition, the autoregressive generation in LLMs is extremely time-consuming, while the ability of controlling this generated length can reduce the inference cost by limiting the length. Therefore, we propose a prompt-based length control method to achieve high-accuracy length controlled generation. In particular, we adopt reinforcement learning with the reward signal given by either trainable or rule-based reward models, which further enhances the length-control ability of LLMs by rewarding outputs that follows pre-defined…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
- Rule based rewards seem effective and more natural than 0-1 error that is often used for constraints - the standard prompt extractors (SPE) seem to achieve a high accuracy, including on held-out prompt templates - Adding RL + filter seems to improve the ability of models to adhere to length constraints
- Much of the paper is focused on using existing techniques, e.g. training models on the length of existing texts (this was used for length-controlled T5 infilling for instance) and PPO RL - While the SPE accuracy seems high on held-out templates, all templates were written by the authors and are unlikely to cover the diversity of what humans might use in the wild. It would be useful to find a way to test generalizability to real user inputs. This is particularly important because the authors sp
- The proposed method is simple and efficient to control output length of LMs. - The paper clearly defines a set of standard control types with appropriate reward functions. - The paper is well written and easy to follow.
- It appears that there is a significant improvement in the control settings of 'Equal' and 'Between' when considering the core setting between `Prompt` and `Prompt + RL`. However, it remains unclear whether the improvement persists when the method is integrated into larger LMs such as LLaMA. This limits the extent of their contributions, despite the potential practical applicability of the method due to its simplicity. - The paper does not compare to existing methods, such as LenAtten and LAAM,
* This paper proposes a reasonable method to control an LM to generate response with a length condition. This method is simple and can be effective. Specifically, this method mainly adopts PPO to optimize the LM with the authors’ designed rewards. The authors propose two variations as the reward function: (1) A standard prompt extractor (SPE) plus a rule-based reward function (Table1); (2) A GPT2/BERT-based trained reward model. Both variations use the synthetic, designed standard control prompt
* While this paper puts emphasis on LLM, the experiments use models with 124M, 355M and 774M, which can be controversial to be claimed as LLM. The behavior of an LM can be significantly different when the size is in Million and Billion scales. Also, which GPT is used as the main model is not specified. Because the paper only mentions the word “GPT”, I will guess it is the GPT1 (Radford et al., 2018) or the GPT2 used for the SPE. * Radford, Alec, et al. "Improving language understanding by gene
1) The paper shows improvement in length-control ability while maintaining the ROUGE scores. 2) The paper proposes and investigates different prompt extractors, and shows that a BERT-based model achieves perfect accuracy in both seen and unseen prompts. 3) The paper is the first (or one of the first) to apply the PPO algorithm to length-controllability in summarization. 4) The ablation study shows that a simple rule-based reward performs better than model-based rewards.
1) The main contributions of this paper are very incremental. For example, - 1.1) Controlling the length by input prompts has already been done by (Fan et al., 2018) and CTRLsum (He et al., 2022). - 1.2) Applying RL to controlling the length has already been done by (Bian et al., 2019) - 1.3) Sample filtering can be considered (I believe) as a weaker version of minimum risk decoding e.g., Freitag et al., 2022 - 1.4) The relevant references CTRLsum (He et al., 2022) and (Bian e
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dense Connections · Absolute Position Encodings · Residual Connection
