Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning

Bokai Hu; Sai Ashish Somayajula; Xin Pan; Pengtao Xie

arXiv:2410.11020·cs.CL·September 29, 2025

Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning

Bokai Hu, Sai Ashish Somayajula, Xin Pan, Pengtao Xie

PDF

Open Access 5 Reviews

TL;DR

This paper demonstrates that applying reinforcement learning, specifically PPO, to large language models significantly enhances their natural language understanding abilities, outperforming traditional fine-tuning and prompting methods.

Contribution

It introduces a reinforcement learning framework for LLMs, showing substantial improvements in NLU tasks over supervised fine-tuning and prompting techniques.

Findings

01

PPO improves GLUE scores by 6.3 points on average.

02

PPO outperforms zero-shot and few-shot prompting by 38.7 and 26.1 points.

03

PPO-tuned models surpass GPT-4o on multiple NLU benchmarks.

Abstract

Instruction-fine-tuned large language models (LLMs) under 14B parameters continue to underperform on natural language understanding (NLU) tasks, often trailing smaller models like BERT-base on benchmarks such as GLUE and SuperGLUE. Motivated by the success of reinforcement learning in reasoning tasks (e.g., DeepSeek), we explore Proximal Policy Optimization (PPO) as a framework to improve the NLU capabilities of LLMs. We frame NLU as a reinforcement learning environment, treating token generation as a sequence of actions and optimizing for reward signals based on alignment with ground-truth labels. PPO consistently outperforms supervised fine-tuning, yielding an average improvement of 6.3 points on GLUE, and surpasses zero-shot and few-shot prompting by 38.7 and 26.1 points, respectively. Notably, PPO-tuned models outperform GPT-4o by over 4\% on average across sentiment and natural…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

The paper is very well written and very easy to follow.

Weaknesses

My primary concern is the limited contribution of this paper. Personally, I didn’t find anything new or insightful in it. The conclusions are rather obvious. The authors highlight the improvement of PPO fine-tuned LLMs with sufficient training data compared to zero-shot and few-shot LLMs, but this outcome is quite predictable.

Reviewer 02Rating 3Confidence 4

Strengths

Research on natural language understanding capabilites of LLMs is an important part of AI. The paper reports experimental results of the combination of two common methods, LoRA and PPO, on GLUE and SuperGLUE benchmarks.

Weaknesses

LoRA and PPO are commonly used in LLM post training. For example, trlx [1] is one of popular LLM post-training tools, implementing both PPO and LoRA. This paper does not present novel methods or findings. [1] trlX: A Framework for Large Scale Reinforcement Learning from Human Feedback

Reviewer 03Rating 3Confidence 4

Strengths

- The authors conducted comprehensive experiments on both the GLUE and SuperGLUE benchmarks, providing a thorough evaluation of their approach. - The study includes experiments across various LLMs, including LLAMA2, Qwen2.5, and MPT, demonstrating the robustness and generalizability of the proposed method.

Weaknesses

- For the single-task (ST) setting, the motivation for training a 7B parameter LLM to address NLU tasks remains unclear. By checking the GLUE and SuperGLUE leaderboards, it is evident that smaller bidirectional transformers with fewer than 1.5B parameters already achieve high performance (90+ overall scores) on NLU/classification tasks. This raises the question of why a significantly larger 7B LLM is needed to tackle NLU tasks using autoregressive decoding. - For the multi-task (MT) setting, usi

Reviewer 04Rating 3Confidence 4

Strengths

This paper has several strengths worth noting: 1. The writing is clear, and the logical flow is easy to follow. 2. The experimental design is comprehensive, taking into account multiple factors that can influence performance, such as various datasets, backbones, and baselines. 3. I appreciate the attention to detail in this article, especially the thorough introduction to the prerequisites for PPO and the detailed design of the prompts, etc.

Weaknesses

There are several weaknesses that need to be addressed: 1. This paper focuses on an interesting problem: decoder-only models perform worse than smaller encoder-only models. However, I question whether this issue still holds true with the emergence of more powerful language models, such as GPT-4o. It would be beneficial to test the most recent state-of-the-art models to evaluate the validity of this problem. 2. The datasets utilized in this paper (GLUE and SuperGLUE) are overly simplistic. I re

Reviewer 05Rating 3Confidence 4

Strengths

1. The authors construct a complete pipeline to improve the performance of LLMs on NLU tasks. 2. These improvements are consistent across different LLMs, highlighting PPO’s robustness and effectiveness in enhancing the NLU capabilities of LLMs.

Weaknesses

1. Lack of novelty: This work simply employs LoRA + PPO on NLU tasks, without introducing new techniques or insights compared to previous works such as [1]. 2. Insufficient baselines: though claiming that "LLMs with prompting methods fall short on natural language understanding (NLU) tasks", the most powerful proprietary LLMs such as GPT-4o are not evaluated in the experiments. Besides, popular prompting methods such as few-shot CoT are also not evaluated. In addition, for SuperGLUE, Qwen-2.5-7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems

MethodsEntropy Regularization · Proximal Policy Optimization · Shrink and Fine-Tune