Libra: Large Chinese-based Safeguard for AI Content
Ziyang Chen, Huimu Yu, Xing Wu, Dongqin Liu, and Songlin Hu

TL;DR
Libra-Guard is a novel safeguard system for Chinese language models that uses a two-stage training process and a new benchmark, Libra-Test, to improve safety and evaluate harm mitigation effectiveness.
Contribution
The paper introduces Libra-Guard, a two-stage curriculum training safeguard system, and Libra-Test, a benchmark for evaluating safety in Chinese LLMs, with significant performance improvements.
Findings
Libra-Guard achieves 86.79% accuracy in safety evaluation.
Outperforms existing safeguard models like Qwen2.5-14B-Instruct.
Near the safety performance of models like GPT-4o.
Abstract
Large language models (LLMs) excel in text understanding and generation but raise significant safety and ethical concerns in high-stakes applications. To mitigate these risks, we present Libra-Guard, a cutting-edge safeguard system designed to enhance the safety of Chinese-based LLMs. Leveraging a two-stage curriculum training pipeline, Libra-Guard enhances data efficiency by employing guard pretraining on synthetic samples, followed by fine-tuning on high-quality, real-world data, thereby significantly reducing reliance on manual annotations. To enable rigorous safety evaluations, we also introduce Libra-Test, the first benchmark specifically designed to evaluate the effectiveness of safeguard systems for Chinese content. It covers seven critical harm scenarios and includes over 5,700 samples annotated by domain experts. Experiments show that Libra-Guard achieves 86.79% accuracy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
