RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming

Xiang Zheng; Xingjun Ma; Wei-Bin Lee; Cong Wang

arXiv:2506.04302·cs.LG·June 6, 2025

RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming

Xiang Zheng, Xingjun Ma, Wei-Bin Lee, Cong Wang

PDF

Open Access 1 Repo

TL;DR

RedRFT introduces a lightweight, standardized benchmark for reinforcement fine-tuning-based red teaming of large language models, enabling consistent evaluation, reproducibility, and rapid prototyping of RFT methods.

Contribution

It provides a modular, easy-to-use benchmark that addresses implementation variability and supports diverse experimental configurations for RFT-based red teaming.

Findings

01

Conducted extensive ablation studies on key RFT components.

02

Demonstrated the impact of implementation choices on RFT performance.

03

Provided a comprehensive, reproducible framework for future RFT research.

Abstract

Red teaming has proven to be an effective method for identifying and mitigating vulnerabilities in Large Language Models (LLMs). Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy among existing red teaming techniques. However, a lack of a unified benchmark hinders current RFT-based red teaming methods. Implementation details, especially in Proximal Policy Optimization (PPO)-based RFT, significantly affect outcome stability and reproducibility. To address this issue, we introduce RedRFT, a lightweight benchmark designed to simplify and standardize the implementation and evaluation of RFT-based red teaming. RedRFT combines the design strengths of both single-file CleanRL and highly modularized Tianshou, offering high-quality single-file red teaming implementations and modular PPO core components, such as the General Advantage Estimator. It supports a variety of token and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

x-zheng16/redrft
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Software Engineering Research

MethodsEntropy Regularization · Proximal Policy Optimization