JADE: A Linguistics-based Safety Evaluation Platform for Large Language   Models

Mi Zhang; Xudong Pan; Min Yang

arXiv:2311.00286·cs.CL·December 12, 2023·1 cites

JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models

Mi Zhang, Xudong Pan, Min Yang

PDF

Open Access 1 Repo

TL;DR

JADE is a linguistics-based fuzzing platform that systematically generates complex, unsafe questions to evaluate and expose safety vulnerabilities in various large language models across Chinese and English.

Contribution

JADE introduces a novel linguistics-inspired method using transformational grammar to generate complex unsafe questions, revealing safety flaws in multiple LLMs.

Findings

01

Achieved an average unsafe generation ratio of 70% across tested LLMs.

02

Generated safety benchmarks with highly threatening questions that are natural and fluent.

03

Demonstrated effectiveness of linguistic transformations in breaking LLM safety guardrails.

Abstract

In this paper, we present JADE, a targeted linguistic fuzzing platform which strengthens the linguistic complexity of seed questions to simultaneously and consistently break a wide range of widely-used LLMs categorized in three groups: eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. JADE generates three safety benchmarks for the three groups of LLMs, which contain unsafe questions that are highly threatening: the questions simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of $70%$ (please see the table below), while are still natural questions, fluent and preserving the core unsafe semantics. We release the benchmark demos generated for commercial English LLMs and open-sourced English LLMs in the following link: https://github.com/whitzard-ai/jade-db. For readers who are interested in evaluating on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

whitzard-ai/jade-db
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification