Language Model Self-improvement by Reinforcement Learning Contemplation

Jing-Cheng Pang; Pengyuan Wang; Kaiyuan Li; Xiong-Hui Chen; Jiacheng; Xu; Zongzhang Zhang; and Yang Yu

arXiv:2305.14483·cs.CL·May 25, 2023·2 cites

Language Model Self-improvement by Reinforcement Learning Contemplation

Jing-Cheng Pang, Pengyuan Wang, Kaiyuan Li, Xiong-Hui Chen, Jiacheng, Xu, Zongzhang Zhang, and Yang Yu

PDF

Open Access 3 Reviews

TL;DR

This paper presents SIRLC, an unsupervised reinforcement learning method where LLMs self-assess and improve their performance across NLP tasks without external labels, enhancing accuracy and translation quality.

Contribution

Introduces SIRLC, a novel unsupervised reinforcement learning approach enabling LLMs to self-improve by evaluating and scoring their own generated outputs.

Findings

01

5.6% increase in reasoning accuracy

02

BERTScore improved from 0.82 to 0.86 in translation

03

Effective across various NLP tasks

Abstract

Large Language Models (LLMs) have exhibited remarkable performance across various natural language processing (NLP) tasks. However, fine-tuning these models often necessitates substantial supervision, which can be expensive and time-consuming to obtain. This paper introduces a novel unsupervised method called LanguageModel Self-Improvement by Reinforcement Learning Contemplation (SIRLC) that improves LLMs without reliance on external labels. Our approach is grounded in the observation that it is simpler for language models to assess text quality than to generate text. Building on this insight, SIRLC assigns LLMs dual roles as both student and teacher. As a student, the LLM generates answers to unlabeled questions, while as a teacher, it evaluates the generated text and assigns scores accordingly. The model parameters are updated using reinforcement learning to maximize the evaluation…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The fact that evaluation is sometimes easier than generation is well known, but the authors show concrete experimental results that support this in non-trivial settings. - Overall, the paper is well written and easy to follow.

Weaknesses

- The novelty of the work is limited. The overall approach is very similar to RLAIF. - The effectiveness of the proposed approach is demonstrated using the 780M model of Flan-T5, but it is not clear how effective it is when other or larger models are used.

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

Overall, this paper seems reasonably solid to this reviewer. The approach proposed is simple and general, yet it also seems novel and underexplored at least to this reviewer. The results suggest that it helps on reasoning tasks which could suggest something interesting is happening during the RL process. It's good to see results on a variety of different model sizes too, which suggest also that the gain doesn't go away with scale (at least for e.g. the BigBench task "Penguins in a Table") I hav

Weaknesses

The main concerns to this reviewer are: * the datasets considered are a bit toy. It would be great to see other experiments from other domains (maybe something like math word problems?) or things considered by some of the other common papers in this space. * The approach is limited to having a training dataset, though this is addressed appropriately in Appendix A.1 (at least to this reviewer, maybe extending this could be left for future work). However, I am a bit concerned that there might be s

Reviewer 03Rating 3· reject, not good enoughConfidence 3

Strengths

This work demonstrates empirical gains using self-improvement via evaluating self-generated outputs.

Weaknesses

1. The approaches demonstrated in this work is not new, and the authors do not discuss prior work. For there is a family of work in evaluating self-generation (https://arxiv.org/abs/2210.03629), how is this work different? 2. This work uses CoT, do the baselines use CoT as well? How much of the gain is coming from CoT? This work is missing ablations that quantify the gains from the primary contribution. 3. The majority of this manuscript describe background information. The main contribution of

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications