JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models
Mi Zhang, Xudong Pan, Min Yang

TL;DR
JADE is a linguistics-based fuzzing platform that systematically generates complex, unsafe questions to evaluate and expose safety vulnerabilities in various large language models across Chinese and English.
Contribution
JADE introduces a novel linguistics-inspired method using transformational grammar to generate complex unsafe questions, revealing safety flaws in multiple LLMs.
Findings
Achieved an average unsafe generation ratio of 70% across tested LLMs.
Generated safety benchmarks with highly threatening questions that are natural and fluent.
Demonstrated effectiveness of linguistic transformations in breaking LLM safety guardrails.
Abstract
In this paper, we present JADE, a targeted linguistic fuzzing platform which strengthens the linguistic complexity of seed questions to simultaneously and consistently break a wide range of widely-used LLMs categorized in three groups: eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. JADE generates three safety benchmarks for the three groups of LLMs, which contain unsafe questions that are highly threatening: the questions simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of (please see the table below), while are still natural questions, fluent and preserving the core unsafe semantics. We release the benchmark demos generated for commercial English LLMs and open-sourced English LLMs in the following link: https://github.com/whitzard-ai/jade-db. For readers who are interested in evaluating on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
