Ignore the KL Penalty! Boosting Exploration on Critical Tokens to   Enhance RL Fine-Tuning

Jean Vassoyan; Nathana\"el Beau; Roman Plaud

arXiv:2502.06533·cs.CL·February 11, 2025

Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning

Jean Vassoyan, Nathana\"el Beau, Roman Plaud

PDF

Open Access 1 Video

TL;DR

This paper proposes a modification to the KL penalty in RL fine-tuning of language models, focusing on critical tokens to improve exploration and enhance long-term goal achievement.

Contribution

It introduces a simple KL penalty adjustment that emphasizes exploration on critical tokens, boosting RL fine-tuning effectiveness.

Findings

01

Varying pre-training levels affect exploration dynamics.

02

Critical tokens significantly influence final outcomes.

03

Modified KL penalty improves exploration efficiency.

Abstract

The ability to achieve long-term goals is a key challenge in the current development of large language models (LLMs). To address this, pre-trained LLMs can be fine-tuned with reinforcement learning (RL) to explore solutions that optimize a given goal. However, exploration with LLMs is difficult, as a balance has to be struck between discovering new solutions and staying close enough to the pre-trained model, so as not to degrade basic capabilities. This is typically controlled with a Kullback-Leibler (KL) penalty. In this paper, we investigate the exploration dynamics of a small language model on a simple arithmetic task. We show how varying degrees of pre-training influence exploration and demonstrate the importance of "critical tokens" which have a dramatic impact on the final outcome. Consequently, we introduce a simple modification to the KL penalty that favors exploration on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning· underline

Taxonomy

TopicsRadio Frequency Integrated Circuit Design · VLSI and Analog Circuit Testing · 3D IC and TSV technologies