RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming
Xiang Zheng, Xingjun Ma, Wei-Bin Lee, Cong Wang

TL;DR
RedRFT introduces a lightweight, standardized benchmark for reinforcement fine-tuning-based red teaming of large language models, enabling consistent evaluation, reproducibility, and rapid prototyping of RFT methods.
Contribution
It provides a modular, easy-to-use benchmark that addresses implementation variability and supports diverse experimental configurations for RFT-based red teaming.
Findings
Conducted extensive ablation studies on key RFT components.
Demonstrated the impact of implementation choices on RFT performance.
Provided a comprehensive, reproducible framework for future RFT research.
Abstract
Red teaming has proven to be an effective method for identifying and mitigating vulnerabilities in Large Language Models (LLMs). Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy among existing red teaming techniques. However, a lack of a unified benchmark hinders current RFT-based red teaming methods. Implementation details, especially in Proximal Policy Optimization (PPO)-based RFT, significantly affect outcome stability and reproducibility. To address this issue, we introduce RedRFT, a lightweight benchmark designed to simplify and standardize the implementation and evaluation of RFT-based red teaming. RedRFT combines the design strengths of both single-file CleanRL and highly modularized Tianshou, offering high-quality single-file red teaming implementations and modular PPO core components, such as the General Advantage Estimator. It supports a variety of token and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Software Engineering Research
MethodsEntropy Regularization · Proximal Policy Optimization
