Language Model Self-improvement by Reinforcement Learning Contemplation
Jing-Cheng Pang, Pengyuan Wang, Kaiyuan Li, Xiong-Hui Chen, Jiacheng, Xu, Zongzhang Zhang, and Yang Yu

TL;DR
This paper presents SIRLC, an unsupervised reinforcement learning method where LLMs self-assess and improve their performance across NLP tasks without external labels, enhancing accuracy and translation quality.
Contribution
Introduces SIRLC, a novel unsupervised reinforcement learning approach enabling LLMs to self-improve by evaluating and scoring their own generated outputs.
Findings
5.6% increase in reasoning accuracy
BERTScore improved from 0.82 to 0.86 in translation
Effective across various NLP tasks
Abstract
Large Language Models (LLMs) have exhibited remarkable performance across various natural language processing (NLP) tasks. However, fine-tuning these models often necessitates substantial supervision, which can be expensive and time-consuming to obtain. This paper introduces a novel unsupervised method called LanguageModel Self-Improvement by Reinforcement Learning Contemplation (SIRLC) that improves LLMs without reliance on external labels. Our approach is grounded in the observation that it is simpler for language models to assess text quality than to generate text. Building on this insight, SIRLC assigns LLMs dual roles as both student and teacher. As a student, the LLM generates answers to unlabeled questions, while as a teacher, it evaluates the generated text and assigns scores accordingly. The model parameters are updated using reinforcement learning to maximize the evaluation…
Peer Reviews
Decision·ICLR 2024 poster
- The fact that evaluation is sometimes easier than generation is well known, but the authors show concrete experimental results that support this in non-trivial settings. - Overall, the paper is well written and easy to follow.
- The novelty of the work is limited. The overall approach is very similar to RLAIF. - The effectiveness of the proposed approach is demonstrated using the 780M model of Flan-T5, but it is not clear how effective it is when other or larger models are used.
Overall, this paper seems reasonably solid to this reviewer. The approach proposed is simple and general, yet it also seems novel and underexplored at least to this reviewer. The results suggest that it helps on reasoning tasks which could suggest something interesting is happening during the RL process. It's good to see results on a variety of different model sizes too, which suggest also that the gain doesn't go away with scale (at least for e.g. the BigBench task "Penguins in a Table") I hav
The main concerns to this reviewer are: * the datasets considered are a bit toy. It would be great to see other experiments from other domains (maybe something like math word problems?) or things considered by some of the other common papers in this space. * The approach is limited to having a training dataset, though this is addressed appropriately in Appendix A.1 (at least to this reviewer, maybe extending this could be left for future work). However, I am a bit concerned that there might be s
This work demonstrates empirical gains using self-improvement via evaluating self-generated outputs.
1. The approaches demonstrated in this work is not new, and the authors do not discuss prior work. For there is a family of work in evaluating self-generation (https://arxiv.org/abs/2210.03629), how is this work different? 2. This work uses CoT, do the baselines use CoT as well? How much of the gain is coming from CoT? This work is missing ablations that quantify the gains from the primary contribution. 3. The majority of this manuscript describe background information. The main contribution of
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
