SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks
Hongye Cao, Sijia Jing, Yanming Wang, Ziyue Peng, Zhixin Bai, Zhe Cao, Meng Fang, Fan Feng, Boyan Wang, Jiaheng Liu, Tianpei Yang, Jing Huo, Yang Gao, Fanyu Meng, Xi Yang, Chao Deng, Junlan Feng

TL;DR
SafeDialBench is a comprehensive benchmark for evaluating large language models' safety in multi-turn dialogues against diverse jailbreak attacks, addressing limitations of previous single-turn, single-attack benchmarks.
Contribution
The paper introduces a fine-grained, multi-dimensional safety benchmark with over 4000 dialogues, a hierarchical safety taxonomy, and an assessment framework for detecting and handling unsafe content.
Findings
Yi-34B-Chat and GLM4-9B-Chat show superior safety performance
Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities
Benchmark covers 6 safety dimensions and 22 dialogue scenarios
Abstract
With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability of identifying and handling unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for…
Peer Reviews
Decision·ICLR 2026 Poster
1. In general, the paper is well written with good structure and is easy to follow. 2. The authors present the critical gaps of current benchmarks, which are single-turn focus with limited jailbreaks and propose a valid benchmark to address them. 3. The proposed benchmark makes use of seven different jailbreak strategies (scene construct, purpose reverse, role play, topic change, reference attack, fallacy attack, probing questions), addresses a range of realistic red-teaming strategies and imp
1. The 7 attack methods, while diverse, are known strategies. The field of jailbreaking is adversarial and evolves rapidly. There's a risk that models will quickly be "patched" against these specific 7 attacks, or even over-fit to this benchmark. 2. The paper treats a fixed response scoring as “successfully attacked” (ASR). The rationale for choosing this score as the cutoff should be defended empirically with sensitivity analysis or motivated by human agreement patterns. 3. The minimum-score
The benchmark is comprehensive in terms of categories and scenarios. Also, the experiments involve many LLM models, which are impressive. The presentation is clear, and the paper is well written.
- My first concern is the human involvement in the dialogue question design. Since human design is not efficient, many multi-turn jailbreak attack methods [1,2] have been developed to automate the generation of questions. So, why not use the existing automatic multi-turn jailbreak queries as initial questions? - Besides, in the benchmarked attack methods, many of them are newly developed, and I wonder how they are effective compared to existing multi-turn jailbreak attack methods [1,2]. - The
1. It covers multi-turn, bilingual, multi-attack settings, which is more realistic than prior single-turn tests. 2. The experiment is comprehensive, and the paper is easy to follow.
1. The paper relies mostly on works from 2022 – 2024, with few citations from 2025 despite the rapid emergence of new jailbreak, safety-alignment, and multi-turn evaluation studies. 2. While the benchmark reveals valuable diagnostic insights, the paper offers no concrete direction for mitigating multi-turn safety failures. It would be helpful to include conceptual or empirical suggestions. 3. The proposed “two-tier hierarchical safety taxonomy” is descriptive but not empirically justified. Seve
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterpreting and Communication in Healthcare · Deception detection and forensic psychology · Topic Modeling
