SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking
Xiangyang Zhu, Yuan Tian, Chunyi Li, Kaiwei Zhang, Wei Sun, Guangtao Zhai

TL;DR
SafetyFlow is an innovative agent-flow system that automates the creation of comprehensive LLM safety benchmarks, significantly reducing manual effort and resource consumption while maintaining high quality and discriminative power.
Contribution
This paper introduces the first fully automated pipeline for LLM safety benchmarking, producing a large, high-quality safety dataset without human intervention.
Findings
SafetyFlow can build a safety benchmark in four days.
The final dataset contains 23,446 queries with low redundancy.
Extensive experiments validate the system's efficacy and efficiency.
Abstract
The rapid proliferation of large language models (LLMs) has intensified the requirement for reliable safety evaluation to uncover model vulnerabilities. To this end, numerous LLM safety evaluation benchmarks are proposed. However, existing benchmarks generally rely on labor-intensive manual curation, which causes excessive time and resource consumption. They also exhibit significant redundancy and limited difficulty. To alleviate these problems, we introduce SafetyFlow, the first agent-flow system designed to automate the construction of LLM safety benchmarks. SafetyFlow can automatically build a comprehensive safety benchmark in only four days without any human intervention by orchestrating seven specialized agents, significantly reducing time and resource cost. Equipped with versatile tools, the agents of SafetyFlow ensure process and cost controllability while integrating human…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper discusses the importance of automating benchmark construction, which is such an important topic with today's rapidly evolving LLMs capabilities. 2. The paper presents a reasonable set of experiments to demonstrate the importance of different design choices, and that the benchmark indeed helps differentiate models in terms of their safety. 3. (Although not explained clearly enough), it is interesting to see that the presented pipeline can be reused for generating benchmarks beyond
1. The paper is limited in novelty and technical depth: 1.1 There already several existing works on automatic benchmark construction for LLM that cover different evaluation angles including safety. While the paper has a related work section (only in the supplementary materials), the paper does not really make a clear distinction between the presented approach and the existing ones. For instance, it is not clear what the advantage of the presented pipeline over Auto-Bencher (https://arxiv.org/ab
1. It automates the human-resource intensive dataset generation process by utilizing separate agents. Its reduced time is impactful compared to other generation methods. 2. It reduces data redundancy by 4.5%, which can hinder the fair evaluation of LLMs. Also, it takes just 4 days to construct the entire benchmark without human labor. 3. Their specialized agents for each stage work harmoniously, and their method covers diverse safety samples, including user characteristics (role, tone, etc.) a
1. The claim of full automation seems to be overstated. The framework still requires significant manual intervention, particularly for the initial data pool construction and the creation of hand-crafted tools for the agents. Furthermore, the categorization agent's strategy of merely aggregating existing benchmarks raises doubts about the novelty and necessity of the LLM's role in this specific step. 2. The claim that the LLM agent reduces redundancy is unsubstantiated. This improvement appears
1. The paper presents a clear and well-motivated problem formulation. The authors identify three concrete limitations of existing safety benchmarks with compelling evidence: resource-intensive construction requiring excessive manual labor, severe redundancy with S-Eval showing over 50% duplication and other benchmarks showing 30%+ duplication as demonstrated in Figure 1. This problem definition is particularly timely given the rapid proliferation of safety benchmarks in recent years, and the pro
1. One major concern is the absence of quality validation for the automatically generated samples. The paper presents no evidence that the 23,446 automatically generated samples are actually meaningful, semantically coherent, or representative of real safety risks. There is no human expert evaluation of the generated data whatsoever. Domain experts have not assessed whether samples correctly belong to their assigned categories or whether the synthetic data captures genuine safety concerns. The a
+ The paper correctly identifies a major bottleneck in LLM safety research: the labor-intensive, slow, and costly nature of manual benchmark creation. + The core idea, using a multi-agent AI system to automate benchmark creation, is innovative.
- The benchmark is generated entirely by AI agents, which means the data could be too synthetic. - There are quite a few technical details in the benchmark construction that are not explained well.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
