EASE: Practical and Efficient Safety Alignment for Small Language Models
Haonan Shi, Guoli Wang, Tu Ouyang, An Wang

TL;DR
EASE is a framework that enhances safety in small language models by selectively applying safety reasoning, significantly reducing jailbreak success and computational overhead for edge deployment.
Contribution
The paper introduces EASE, a novel method for efficient safety alignment in small language models through selective safety reasoning and effective knowledge distillation.
Findings
Reduces jailbreak success rates by up to 17%.
Decreases inference overhead by up to 90%.
Maintains safety and efficiency balance in resource-constrained environments.
Abstract
Small language models (SLMs) are increasingly deployed on edge devices, making their safety alignment crucial yet challenging. Current shallow alignment methods that rely on direct refusal of malicious queries fail to provide robust protection, particularly against adversarial jailbreaks. While deliberative safety reasoning alignment offers deeper alignment for defending against sophisticated attacks, effectively implanting such reasoning capability in SLMs with limited capabilities remains an open challenge. Moreover, safety reasoning incurs significant computational overhead as models apply reasoning to nearly all queries, making it impractical for resource-constrained edge deployment scenarios that demand rapid responses. We propose EASE, a novel framework that enables practical and Efficient safety Alignment for Small languagE models. Our approach first identifies the optimal safety…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Graph Neural Networks · Big Data and Digital Economy
