Mitigating Exaggerated Safety in Large Language Models
Ruchira Ray, Ruchi Bhalani

TL;DR
This paper investigates exaggerated safety behaviors in large language models, proposing multiple prompting strategies to significantly reduce false safety refusals while maintaining helpfulness.
Contribution
It introduces a combination of prompting techniques to mitigate exaggerated safety in LLMs, achieving a 92.9% reduction in safety misclassifications.
Findings
Few-shot prompting is most effective for Llama2.
Interactive prompting works best for Gemma.
Contextual prompting is optimal for Command R+ and Phi-3.
Abstract
As the popularity of Large Language Models (LLMs) grow, combining model safety with utility becomes increasingly important. The challenge is making sure that LLMs can recognize and decline dangerous prompts without sacrificing their ability to be helpful. The problem of "exaggerated safety" demonstrates how difficult this can be. To reduce excessive safety behaviours -- which was discovered to be 26.1% of safe prompts being misclassified as dangerous and refused -- we use a combination of XSTest dataset prompts as well as interactive, contextual, and few-shot prompting to examine the decision bounds of LLMs such as Llama2, Gemma Command R+, and Phi-3. We find that few-shot prompting works best for Llama2, interactive prompting works best Gemma, and contextual prompting works best for Command R+ and Phi-3. Using a combination of these prompting strategies, we are able to mitigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
