Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning
Jean Vassoyan, Nathana\"el Beau, Roman Plaud

TL;DR
This paper proposes a modification to the KL penalty in RL fine-tuning of language models, focusing on critical tokens to improve exploration and enhance long-term goal achievement.
Contribution
It introduces a simple KL penalty adjustment that emphasizes exploration on critical tokens, boosting RL fine-tuning effectiveness.
Findings
Varying pre-training levels affect exploration dynamics.
Critical tokens significantly influence final outcomes.
Modified KL penalty improves exploration efficiency.
Abstract
The ability to achieve long-term goals is a key challenge in the current development of large language models (LLMs). To address this, pre-trained LLMs can be fine-tuned with reinforcement learning (RL) to explore solutions that optimize a given goal. However, exploration with LLMs is difficult, as a balance has to be struck between discovering new solutions and staying close enough to the pre-trained model, so as not to degrade basic capabilities. This is typically controlled with a Kullback-Leibler (KL) penalty. In this paper, we investigate the exploration dynamics of a small language model on a simple arithmetic task. We show how varying degrees of pre-training influence exploration and demonstrate the importance of "critical tokens" which have a dramatic impact on the final outcome. Consequently, we introduce a simple modification to the KL penalty that favors exploration on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsRadio Frequency Integrated Circuit Design · VLSI and Analog Circuit Testing · 3D IC and TSV technologies
