Should LLM Safety Be More Than Refusing Harmful Instructions?

Utsav Maskey; Mark Dras; Usman Naseem

arXiv:2506.02442·cs.CL·June 5, 2025

Should LLM Safety Be More Than Refusing Harmful Instructions?

Utsav Maskey, Mark Dras, Usman Naseem

PDF

Open Access

TL;DR

This paper systematically evaluates LLM safety in complex long-tail encrypted texts, revealing vulnerabilities in safety mechanisms and emphasizing the need for more comprehensive safety strategies beyond simple instruction refusal.

Contribution

It introduces a two-dimensional safety framework and demonstrates that current safeguards may fail under cipher decryption scenarios, highlighting areas for improvement.

Findings

01

Models can decrypt ciphers but may fail safety tests

02

Safety mechanisms can either over-refuse or be unsafe

03

Current safeguards have notable limitations

Abstract

This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety: (1) instruction refusal-the ability to reject harmful obfuscated instructions, and (2) generation safety-the suppression of generating harmful responses. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks: their safety mechanisms fail on at least one safety dimension, leading to unsafe responses or over-refusal. Based on these findings, we evaluate a number of pre-LLM and post-LLM safeguards and discuss their strengths and limitations. This work contributes to understanding the safety of LLM in long-tail text scenarios and provides directions for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Spam and Phishing Detection