TL;DR
This paper introduces AdaCD, a training-free, model-agnostic method that adaptively adjusts refusal behavior in large language models to better balance safety and usability.
Contribution
It proposes a novel adaptive contrastive decoding approach that mitigates over-refusal in LLMs without retraining, improving safety and response appropriateness.
Findings
Reduces refusal ratio for over-refusal queries by 10.35% on average.
Increases refusal ratio for malicious queries by 0.13%.
Works across five benchmark datasets.
Abstract
Safety-aligned large language models (LLMs) often generate refusal responses to harmless queries due to the over-refusal problem. However, existing methods for mitigating over-refusal cannot maintain a low refusal ratio for harmless queries while keeping a high refusal ratio for malicious ones. In this paper, we analyze how system prompts with varying safety levels affect LLM refusal behaviors when facing over-refusal queries. A key observation is that, when LLMs suffer from the over-refusal issue, non-refusal tokens remain present in the next-token candidate list, but the model systematically fails to select them, despite the generation of refusal tokens. Based on this observation, we propose a training-free and model-agnostic approach, Adaptive Contrastive Decoding (AdaCD), to mitigate over-refusal while maintaining LLM safety. First, AdaCD compares the output distributions of the LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
