ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System
Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Charith Peris

TL;DR
The paper introduces ARES, a framework for identifying and repairing dual vulnerabilities in RLHF systems by dynamically generating adversarial prompts and iteratively improving both the reward model and the core language model.
Contribution
ARES systematically discovers systemic weaknesses in RLHF by dual-targeting both the LLM and reward model, and implements a two-stage repair process to enhance safety robustness.
Findings
ARES improves safety robustness across multiple benchmarks.
The framework maintains core model capabilities after repair.
Dual-targeting exposes and mitigates vulnerabilities effectively.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an imperfect Reward Model (RM) can become a single point of failure when it fails to penalize unsafe behaviors. While existing red-teaming approaches primarily target policy-level weaknesses, they overlook what we term systemic weaknesses cases where both the core LLM and the RM fail in tandem. We present ARES, a framework that systematically discovers and mitigates such dual vulnerabilities. ARES employs a ``Safety Mentor'' that dynamically composes semantically coherent adversarial prompts by combining structured component types (topics, personas, tactics, goals) and generates corresponding malicious and safe responses. This dual-targeting approach exposes weaknesses in both the core LLM and the RM simultaneously. Using the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
