Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots

Huw Day; Adrianna Jezierska; Jessica Woodgate

arXiv:2603.01942·cs.HC·March 3, 2026

Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots

Huw Day, Adrianna Jezierska, Jessica Woodgate

PDF

Open Access

TL;DR

This paper proposes using jailbreaking as a user-driven, non-violent method to de-escalate conflicts on social media by exposing and disrupting LLM-powered misinformation campaigns.

Contribution

It introduces a novel user-centric approach to counteracting LLM social media bots through jailbreaking as a peace-building practice.

Findings

01

Jailbreaking can effectively expose LLM automation behaviors.

02

User engagement disrupts the circulation of misleading narratives.

03

The approach offers an alternative to platform-led moderation.

Abstract

Large Language Models have intensified the scale and strategic manipulation of political discourse on social media, leading to conflict escalation. The existing literature largely focuses on platform-led moderation as a countermeasure. In this paper, we propose a user-centric view of "jailbreaking" as an emergent, non-violent de-escalation practice. Online users engage with suspected LLM-powered accounts to circumvent large language model safeguards, exposing automated behaviour and disrupting the circulation of misleading narratives.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Misinformation and Its Impacts · Terrorism, Counterterrorism, and Political Violence