`For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts

Annika M Schoene; Cansu Canca

arXiv:2507.02990·cs.CL·July 8, 2025

`For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts

Annika M Schoene, Cansu Canca

PDF

TL;DR

This paper demonstrates how sophisticated prompt-level jailbreaking techniques can bypass safety filters in large language models, generating harmful content in mental health contexts, highlighting the need for more comprehensive safety measures.

Contribution

The study introduces new adversarial test cases for mental health prompts and empirically evaluates their effectiveness across multiple widely used LLMs.

Findings

01

Bypass techniques successfully generate harmful content in multiple models

02

User intent is often disregarded in safety filter failures

03

Highlights the need for continuous adversarial testing and improved safety protocols

Abstract

Recent advances in large language models (LLMs) have led to increasingly sophisticated safety protocols and features designed to prevent harmful, unethical, or unauthorized outputs. However, these guardrails remain susceptible to novel and creative forms of adversarial prompting, including manually generated test cases. In this work, we present two new test cases in mental health for (i) suicide and (ii) self-harm, using multi-step, prompt-level jailbreaking and bypass built-in content and safety filters. We show that user intent is disregarded, leading to the generation of detailed harmful content and instructions that could cause real-world harm. We conduct an empirical evaluation across six widely available LLMs, demonstrating the generalizability and reliability of the bypass. We assess these findings and the multilayered ethical tensions that they present for their implications on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.