Mitigating Exaggerated Safety in Large Language Models

Ruchira Ray; Ruchi Bhalani

arXiv:2405.05418·cs.CL·August 30, 2024·2 cites

Mitigating Exaggerated Safety in Large Language Models

Ruchira Ray, Ruchi Bhalani

PDF

Open Access

TL;DR

This paper investigates exaggerated safety behaviors in large language models, proposing multiple prompting strategies to significantly reduce false safety refusals while maintaining helpfulness.

Contribution

It introduces a combination of prompting techniques to mitigate exaggerated safety in LLMs, achieving a 92.9% reduction in safety misclassifications.

Findings

01

Few-shot prompting is most effective for Llama2.

02

Interactive prompting works best for Gemma.

03

Contextual prompting is optimal for Command R+ and Phi-3.

Abstract

As the popularity of Large Language Models (LLMs) grow, combining model safety with utility becomes increasingly important. The challenge is making sure that LLMs can recognize and decline dangerous prompts without sacrificing their ability to be helpful. The problem of "exaggerated safety" demonstrates how difficult this can be. To reduce excessive safety behaviours -- which was discovered to be 26.1% of safe prompts being misclassified as dangerous and refused -- we use a combination of XSTest dataset prompts as well as interactive, contextual, and few-shot prompting to examine the decision bounds of LLMs such as Llama2, Gemma Command R+, and Phi-3. We find that few-shot prompting works best for Llama2, interactive prompting works best Gemma, and contextual prompting works best for Command R+ and Phi-3. Using a combination of these prompting strategies, we are able to mitigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning