Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders

Jinseok Kim; Jaewon Jung; Sangyeop Kim; Sohyung Park; Sungzoon Cho

arXiv:2407.06851·cs.CL·July 10, 2024

Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders

Jinseok Kim, Jaewon Jung, Sangyeop Kim, Sohyung Park, Sungzoon Cho

PDF

Open Access 1 Repo

TL;DR

This paper explores how sentence encoders can identify and classify unsafe prompts to improve the safety of large language models, introducing new datasets and metrics for evaluation.

Contribution

It introduces new datasets and the Categorical Purity metric to evaluate sentence encoders' ability to detect and classify unsafe prompts, highlighting their effectiveness and limitations.

Findings

01

Sentence encoders can distinguish safe from unsafe prompts.

02

Existing encoders have limitations in classifying all unsafe prompt types.

03

New datasets and metrics facilitate better evaluation of safety detection capabilities.

Abstract

Despite the impressive capabilities of Large Language Models (LLMs) in various tasks, their vulnerability to unsafe prompts remains a critical issue. These prompts can lead LLMs to generate responses on illegal or sensitive topics, posing a significant threat to their safe and ethical use. Existing approaches attempt to address this issue using classification models, but they have several drawbacks. With the increasing complexity of unsafe prompts, similarity search-based techniques that identify specific features of unsafe prompts provide a more robust and effective solution to this evolving problem. This paper investigates the potential of sentence encoders to distinguish safe from unsafe prompts, and the ability to classify various unsafe prompts according to a safety taxonomy. We introduce new pairwise datasets and the Categorical Purity (CP) metric to measure this capability. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jwdanieljung/safe-embed
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection