Self-HarmLLM: Can Large Language Model Harm Itself?
Heehwan Kim, Sungjune Park, Daeseon Choi

TL;DR
This paper explores whether large language models can generate harmful outputs themselves through a novel attack scenario, revealing limitations in current guardrails and evaluation methods.
Contribution
It introduces the Self-HarmLLM scenario, demonstrating that models can produce ambiguous harmful queries that bypass guardrails, highlighting the need for improved safety measures and evaluation techniques.
Findings
Up to 52% transformation success rate in zero-shot conditions
Up to 41% jailbreak success rate in few-shot conditions
Automated evaluations overestimate jailbreak success by 52% on average
Abstract
Large Language Models (LLMs) are generally equipped with guardrails to block the generation of harmful responses. However, existing defenses always assume that an external attacker crafts the harmful query, and the possibility of a model's own output becoming a new attack vector has not been sufficiently explored. In this study, we propose the Self-HarmLLM scenario, which uses a Mitigated Harmful Query (MHQ) generated by the same model as a new input. An MHQ is an ambiguous query whose original intent is preserved while its harmful nature is not directly exposed. We verified whether a jailbreak occurs when this MHQ is re-entered into a separate session of the same model. We conducted experiments on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions. The results showed up to 52% transformation success rate and up to 33%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
