Is Reinforcement Learning (Not) for Natural Language Processing:   Benchmarks, Baselines, and Building Blocks for Natural Language Policy   Optimization

Rajkumar Ramamurthy; Prithviraj Ammanabrolu; Kiant\'e Brantley; Jack; Hessel; Rafet Sifa; Christian Bauckhage; Hannaneh Hajishirzi; Yejin Choi

arXiv:2210.01241·cs.CL·December 5, 2023·54 cites

Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kiant\'e Brantley, Jack, Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, Yejin Choi

PDF

Open Access 3 Repos 1 Video

TL;DR

This paper introduces RL4LMs, a benchmark, and NLPO algorithm to evaluate and improve reinforcement learning methods for aligning large language models with human preferences, addressing empirical challenges in NLP.

Contribution

It provides an open-source library, a new benchmark, and a novel RL algorithm, advancing practical RL applications for NLP model alignment.

Findings

01

RL techniques outperform supervised methods in aligning LMs to human preferences.

02

NLPO shows greater stability and performance than previous policy gradient methods.

03

The benchmark enables standardized evaluation of RL algorithms for NLP tasks.

Abstract

We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of open-source libraries and benchmarks customized for LM alignment. Thus, a question rises in the research community: is RL a practical paradigm for NLP? To help answer this, we first introduce an open-source modular library, RL4LMs (Reinforcement Learning for Language Models), for optimizing language generators with RL. The library consists of on-policy RL algorithms that can be used to train any encoder or encoder-decoder LM in the HuggingFace library (Wolf et al. 2020) with an arbitrary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsLib · Entropy Regularization · Proximal Policy Optimization