Should LLM Safety Be More Than Refusing Harmful Instructions?
Utsav Maskey, Mark Dras, Usman Naseem

TL;DR
This paper systematically evaluates LLM safety in complex long-tail encrypted texts, revealing vulnerabilities in safety mechanisms and emphasizing the need for more comprehensive safety strategies beyond simple instruction refusal.
Contribution
It introduces a two-dimensional safety framework and demonstrates that current safeguards may fail under cipher decryption scenarios, highlighting areas for improvement.
Findings
Models can decrypt ciphers but may fail safety tests
Safety mechanisms can either over-refuse or be unsafe
Current safeguards have notable limitations
Abstract
This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety: (1) instruction refusal-the ability to reject harmful obfuscated instructions, and (2) generation safety-the suppression of generating harmful responses. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks: their safety mechanisms fail on at least one safety dimension, leading to unsafe responses or over-refusal. Based on these findings, we evaluate a number of pre-LLM and post-LLM safeguards and discuss their strengths and limitations. This work contributes to understanding the safety of LLM in long-tail text scenarios and provides directions for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Spam and Phishing Detection
