ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors
Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun,, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang

TL;DR
ShieldLM is a novel LLM-based safety detector that aligns with safety standards, offers customization and explanations, and outperforms existing methods in safety detection tasks.
Contribution
We introduce ShieldLM, a comprehensive safety detection framework for LLMs that supports customization, explainability, and is trained on a large bilingual safety-annotated dataset.
Findings
ShieldLM surpasses strong baselines on multiple test sets.
It demonstrates high customizability and explainability.
Effective as a safety evaluator for advanced LLMs.
Abstract
The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with common safety standards, supports customizable detection rules, and provides explanations for its decisions. To train ShieldLM, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. Through extensive experiments, we demonstrate that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. Besides performing well on standard detection datasets, ShieldLM has also been shown to be effective as a safety evaluator for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗thu-coai/ShieldLM-14B-qwenmodel· 579 dl· ♡ 14579 dl♡ 14
- 🤗thu-coai/ShieldLM-13B-baichuan2model· 7 dl· ♡ 37 dl♡ 3
- 🤗thu-coai/ShieldLM-7B-internlm2model· 67 dl· ♡ 1167 dl♡ 11
- 🤗thu-coai/ShieldLM-6B-chatglm3model· 9.1k dl· ♡ 49.1k dl♡ 4
- 🤗RichardErkhov/thu-coai_-_ShieldLM-14B-qwen-ggufmodel· 15 dl15 dl
- 🤗RichardErkhov/thu-coai_-_ShieldLM-13B-baichuan2-ggufmodel· 9 dl9 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Adversarial Robustness in Machine Learning · Data Quality and Management
