ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Jiacheng Liang; Yao Ma; Tharindu Kumarage; Satyapriya Krishna; Rahul Gupta; Kai-Wei Chang; Aram Galstyan; Charith Peris

arXiv:2604.18789·cs.AI·April 22, 2026

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Charith Peris

PDF

TL;DR

The paper introduces ARES, a framework for identifying and repairing dual vulnerabilities in RLHF systems by dynamically generating adversarial prompts and iteratively improving both the reward model and the core language model.

Contribution

ARES systematically discovers systemic weaknesses in RLHF by dual-targeting both the LLM and reward model, and implements a two-stage repair process to enhance safety robustness.

Findings

01

ARES improves safety robustness across multiple benchmarks.

02

The framework maintains core model capabilities after repair.

03

Dual-targeting exposes and mitigates vulnerabilities effectively.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an imperfect Reward Model (RM) can become a single point of failure when it fails to penalize unsafe behaviors. While existing red-teaming approaches primarily target policy-level weaknesses, they overlook what we term systemic weaknesses cases where both the core LLM and the RM fail in tandem. We present ARES, a framework that systematically discovers and mitigates such dual vulnerabilities. ARES employs a ``Safety Mentor'' that dynamically composes semantically coherent adversarial prompts by combining structured component types (topics, personas, tactics, goals) and generates corresponding malicious and safe responses. This dual-targeting approach exposes weaknesses in both the core LLM and the RM simultaneously. Using the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.