EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions

Xiaorui Wu; Fei Li; Xiaofeng Mao; Xin Zhang; Li Zheng; Yuxiang Peng; Chong Teng; Donghong Ji; Zhuang Li

arXiv:2505.23473·cs.AI·January 21, 2026

EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions

Xiaorui Wu, Fei Li, Xiaofeng Mao, Xin Zhang, Li Zheng, Yuxiang Peng, Chong Teng, Donghong Ji, Zhuang Li

PDF

Open Access

TL;DR

EVOREFUSE introduces an evolutionary algorithm to generate diverse pseudo-malicious instructions that effectively trigger LLM refusals, aiding in evaluation and mitigation of over-refusal issues to improve safety and user experience.

Contribution

The paper presents EVOREFUSE, a novel evolutionary prompt optimization method that creates diverse, effective instructions for testing and training LLMs to reduce over-refusals, surpassing existing approaches.

Findings

01

EVOREFUSE achieves 85.34% higher refusal triggering rate across 9 LLMs.

02

The generated datasets improve model alignment and reduce over-refusals.

03

Models trained on EVOREFUSE-ALIGN show up to 29.85% fewer over-refusals.

Abstract

Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification