SAFETY-J: Evaluating Safety with Critique
Yixiu Liu, Yuxiang Zheng, Shijie Xia, Jiajun Li, Yi Tu and, Chaoling Song, Pengfei Liu

TL;DR
SAFETY-J is a bilingual safety evaluator for LLMs that provides detailed critiques of content safety, improving transparency, interpretability, and reliability in safety assessments.
Contribution
It introduces a critique-based safety evaluation framework with automated meta-evaluation and iterative learning, advancing beyond binary safety classification methods.
Findings
SAFETY-J offers more nuanced safety judgments.
It achieves higher critique quality and assessment accuracy.
The system supports scalable, continuous safety evaluation.
Abstract
The deployment of Large Language Models (LLMs) in content generation raises significant safety concerns, particularly regarding the transparency and interpretability of content evaluations. Current methods, primarily focused on binary safety classifications, lack mechanisms for detailed critique, limiting their utility for model improvement and user trust. To address these limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment. SAFETY-J utilizes a robust training dataset that includes diverse dialogues and augmented query-response pairs to assess safety across various scenarios comprehensively. We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention, facilitating scalable and continuous improvement. Additionally, SAFETY-J employs an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOccupational Health and Safety Research · Risk and Safety Analysis
