On Prompt-Driven Safeguarding for Large Language Models

Chujie Zheng; Fan Yin; Hao Zhou; Fandong Meng; Jie Zhou; Kai-Wei; Chang; Minlie Huang; Nanyun Peng

arXiv:2401.18018·cs.LG·June 4, 2024·5 cites

On Prompt-Driven Safeguarding for Large Language Models

Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei, Chang, Minlie Huang, Nanyun Peng

PDF

Open Access 2 Repos

TL;DR

This paper investigates how safety prompts influence large language models' behavior by analyzing their internal representations, and introduces DRO, a method to optimize safety prompts for better safeguarding without harming model performance.

Contribution

The study reveals the representation dynamics of safety prompts and proposes DRO, a novel method to automatically optimize safety prompts based on model representations.

Findings

01

DRO improves safety prompt effectiveness across multiple LLMs.

02

Safety prompts move queries toward a refusal direction in representation space.

03

Models can distinguish harmful from harmless queries without safety prompts.

Abstract

Prepending model inputs with safety prompts is a common practice for safeguarding large language models (LLMs) against queries with harmful intents. However, the underlying working mechanisms of safety prompts have not been unraveled yet, restricting the possibility of automatically optimizing them to improve LLM safety. In this work, we investigate how LLMs' behavior (i.e., complying with or refusing user queries) is affected by safety prompts from the perspective of model representation. We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction, in which models become more prone to refusing to provide assistance, even when the queries are harmless. On the other hand, LLMs are naturally capable of distinguishing harmful and harmless queries without safety prompts. Inspired by these findings, we propose a method for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVLSI and Analog Circuit Testing · Digital Rights Management and Security · Advancements in Photolithography Techniques