Self-HarmLLM: Can Large Language Model Harm Itself?

Heehwan Kim; Sungjune Park; Daeseon Choi

arXiv:2511.08597·cs.CL·November 13, 2025

Self-HarmLLM: Can Large Language Model Harm Itself?

Heehwan Kim, Sungjune Park, Daeseon Choi

PDF

Open Access

TL;DR

This paper explores whether large language models can generate harmful outputs themselves through a novel attack scenario, revealing limitations in current guardrails and evaluation methods.

Contribution

It introduces the Self-HarmLLM scenario, demonstrating that models can produce ambiguous harmful queries that bypass guardrails, highlighting the need for improved safety measures and evaluation techniques.

Findings

01

Up to 52% transformation success rate in zero-shot conditions

02

Up to 41% jailbreak success rate in few-shot conditions

03

Automated evaluations overestimate jailbreak success by 52% on average

Abstract

Large Language Models (LLMs) are generally equipped with guardrails to block the generation of harmful responses. However, existing defenses always assume that an external attacker crafts the harmful query, and the possibility of a model's own output becoming a new attack vector has not been sufficiently explored. In this study, we propose the Self-HarmLLM scenario, which uses a Mitigated Harmful Query (MHQ) generated by the same model as a new input. An MHQ is an ambiguous query whose original intent is preserved while its harmful nature is not directly exposed. We verified whether a jailbreak occurs when this MHQ is re-entered into a separate session of the same model. We conducted experiments on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions. The results showed up to 52% transformation success rate and up to 33%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)