PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training
Pengfei Du

TL;DR
This paper introduces a PRM-free security alignment framework for large language models that uses red teaming and adversarial training to improve robustness efficiently, reducing computational costs and enhancing security guarantees.
Contribution
The paper proposes a novel PRM-free approach combining automated red teaming and adversarial training, improving security alignment without high computational overhead.
Findings
Achieves superior security performance compared to PRM-based methods.
Reduces computational costs by 61%.
Enhances model robustness through targeted adversarial training.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, yet they pose significant security risks that threaten their safe deployment in critical domains. Current security alignment methodologies predominantly rely on Process Reward Models (PRMs) to evaluate intermediate reasoning steps, introducing substantial computational overhead and scalability constraints. This paper presents a novel PRM-free security alignment framework that leverages automated red teaming and adversarial training to achieve robust security guarantees while maintaining computational efficiency. Our approach systematically identifies vulnerabilities through sophisticated attack strategies including genetic algorithm optimization, multi-agent simulation, and advanced prompt mutation techniques. The framework enhances model robustness via targeted adversarial training with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
