# Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

**Authors:** Weitao Feng, Lixu Wang, Peizhuo Lv, Tianyi Wei, Jie Zhang, Chongyang Gao, Sinong Zhan, Wei Dong

arXiv: 2508.20697 · 2026-05-12

## TL;DR

This paper introduces TokenBuncher, a novel defense mechanism against reinforcement learning-based harmful fine-tuning of large language models, effectively mitigating risks while maintaining model utility.

## Contribution

TokenBuncher is the first targeted defense against RL-based harmful fine-tuning, constraining response entropy to prevent exploitation of reward signals.

## Key findings

- TokenBuncher significantly reduces harmful behaviors in models fine-tuned with RL.
- The defense preserves model performance on benign tasks.
- RL-based fine-tuning poses greater risks than supervised fine-tuning.

## Abstract

As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically demonstrate that reinforcement learning (RL) enables adversaries to more effectively break safety alignment and facilitate more advanced harmful task assistance, under matched computational budgets. To counter this emerging threat, we propose TokenBuncher, the first effective defense specifically targeting RL-based harmful fine-tuning. TokenBuncher suppresses the foundation on which RL relies: model response entropy. By constraining entropy, RL-based fine-tuning can no longer exploit distinct reward signals to drive the model toward harmful behaviors. We realize this defense through entropy-as-reward RL and a Token Noiser mechanism designed to prevent the escalation of harmful capabilities. Extensive experiments across multiple models and RL algorithms show that TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task performance and finetunability. Our results highlight that RL-based harmful fine-tuning poses a greater systemic risk than SFT, and that TokenBuncher provides an effective and general defense.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20697/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20697/full.md

## References

72 references — full list in the complete paper: https://tomesphere.com/paper/2508.20697/full.md

---
Source: https://tomesphere.com/paper/2508.20697