SAFETY-J: Evaluating Safety with Critique

Yixiu Liu; Yuxiang Zheng; Shijie Xia; Jiajun Li; Yi Tu and; Chaoling Song; Pengfei Liu

arXiv:2407.17075·cs.CL·August 14, 2024

SAFETY-J: Evaluating Safety with Critique

Yixiu Liu, Yuxiang Zheng, Shijie Xia, Jiajun Li, Yi Tu and, Chaoling Song, Pengfei Liu

PDF

Open Access 1 Repo

TL;DR

SAFETY-J is a bilingual safety evaluator for LLMs that provides detailed critiques of content safety, improving transparency, interpretability, and reliability in safety assessments.

Contribution

It introduces a critique-based safety evaluation framework with automated meta-evaluation and iterative learning, advancing beyond binary safety classification methods.

Findings

01

SAFETY-J offers more nuanced safety judgments.

02

It achieves higher critique quality and assessment accuracy.

03

The system supports scalable, continuous safety evaluation.

Abstract

The deployment of Large Language Models (LLMs) in content generation raises significant safety concerns, particularly regarding the transparency and interpretability of content evaluations. Current methods, primarily focused on binary safety classifications, lack mechanisms for detailed critique, limiting their utility for model improvement and user trust. To address these limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment. SAFETY-J utilizes a robust training dataset that includes diverse dialogues and augmented query-response pairs to assess safety across various scenarios comprehensively. We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention, facilitating scalable and continuous improvement. Additionally, SAFETY-J employs an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gair-nlp/safety-j
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOccupational Health and Safety Research · Risk and Safety Analysis