EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions
Xiaorui Wu, Fei Li, Xiaofeng Mao, Xin Zhang, Li Zheng, Yuxiang Peng, Chong Teng, Donghong Ji, Zhuang Li

TL;DR
EVOREFUSE introduces an evolutionary algorithm to generate diverse pseudo-malicious instructions that effectively trigger LLM refusals, aiding in evaluation and mitigation of over-refusal issues to improve safety and user experience.
Contribution
The paper presents EVOREFUSE, a novel evolutionary prompt optimization method that creates diverse, effective instructions for testing and training LLMs to reduce over-refusals, surpassing existing approaches.
Findings
EVOREFUSE achieves 85.34% higher refusal triggering rate across 9 LLMs.
The generated datasets improve model alignment and reduce over-refusals.
Models trained on EVOREFUSE-ALIGN show up to 29.85% fewer over-refusals.
Abstract
Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
