AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models

Jinchuan Zhang; Lu Yin; Yan Zhou; Songlin Hu

arXiv:2505.23020·cs.CR·May 30, 2025

AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models

Jinchuan Zhang, Lu Yin, Yan Zhou, Songlin Hu

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

AgentAlign introduces a novel safety alignment framework for large language models, using behavior chains in simulated environments to improve safety without sacrificing helpfulness, addressing malicious use risks.

Contribution

The paper presents a new method leveraging behavior chains for safety alignment in LLMs, enhancing safety while maintaining utility, and provides open-source datasets and code.

Findings

01

Safety improved from 35.8% to 79.5%

02

Enhanced safety with minimal impact on helpfulness

03

Outperforms existing prompting methods

Abstract

The acquisition of agentic capabilities has transformed LLMs from "knowledge providers" to "action executors", a trend that while expanding LLMs' capability boundaries, significantly increases their susceptibility to malicious use. Previous work has shown that current LLM-based agents execute numerous malicious tasks even without being attacked, indicating a deficiency in agentic use safety alignment during the post-training phase. To address this gap, we propose AgentAlign, a novel framework that leverages abstract behavior chains as a medium for safety alignment data synthesis. By instantiating these behavior chains in simulated environments with diverse tool instances, our framework enables the generation of highly authentic and executable instructions while capturing complex multi-step dynamics. The framework further ensures model utility by proportionally synthesizing benign…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

1. Relevant problem: The safety gap between conversational and agentic LLMs is a real concern worth investigating. 2. Systematic data generation: The abstract behavior chain framework provides a structured approach to generating multi-step harmful scenarios, which is important in AI safety research.

Weaknesses

1. The method is missing critical comparisons. This is fundamentally a data generation method, yet there are no comparisons to: - Existing safety datasets: GuardSet-X [1], ToolAlign (only briefly mentioned in the related work), and other multi-step safety datasets. How does training on your dataset compare to training on these? - Guardrail systems: Why not compare against ShieldAgent [2], LlamaGuard, or other input filtering approaches? These operate at inference time without requiring model re

Reviewer 02Rating 6Confidence 3

Strengths

* Clear motivation and presentation: The paper provides a well-motivated discussion of the emerging safety challenges in agentic LLMs, supported by concrete examples and quantitative evidence. The writing is clear and easy to follow. * Originality: The idea of modeling safety through Abstract Behavior Chains is novel and insightful, as it captures multi-step harmful behaviors at the behavioral logic level rather than relying on surface text filters. * The proposed simulation environment and acco

Weaknesses

* While the paper is strong overall, it would benefit from a more comprehensive discussion of related work on plug-and-use safety guardrails for agents, such as GuardAgent, Conseca, and Agrail, to better position AgentAlign within this growing research space. * The training setup is not clearly described in the main text; readers may find it difficult to understand how the proposed dataset and objectives are applied during fine-tuning. Including a concise summary of the training process (current

Reviewer 03Rating 4Confidence 3

Strengths

- Solid synthetic data generation pipeline with human validations. - Great diagrams and plots that help explain things clearly. - Showed great performance when compared to prompting baselines.

Weaknesses

- Only compared to weak prompting baselines. For example, I'd appreciate it if you could at least add one common guardrail baseline, such as Llama Guard 4? This is a common defense method and will be helpful to include to see whether it complements AgentAlign or it's already very effective enough on the benchmarks evaluated. - No adversarial pressure studied. In reality, attackers will not give up after one try, and will likely apply the existing, common jailbreaking approach to the original ha

Code & Models

Repositories

jc-ryan/agentalign
noneOfficial

Datasets

jc-ryan/AgentAlign
dataset· 87 dl
87 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Topic Modeling