Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots
Huw Day, Adrianna Jezierska, Jessica Woodgate

TL;DR
This paper proposes using jailbreaking as a user-driven, non-violent method to de-escalate conflicts on social media by exposing and disrupting LLM-powered misinformation campaigns.
Contribution
It introduces a novel user-centric approach to counteracting LLM social media bots through jailbreaking as a peace-building practice.
Findings
Jailbreaking can effectively expose LLM automation behaviors.
User engagement disrupts the circulation of misleading narratives.
The approach offers an alternative to platform-led moderation.
Abstract
Large Language Models have intensified the scale and strategic manipulation of political discourse on social media, leading to conflict escalation. The existing literature largely focuses on platform-led moderation as a countermeasure. In this paper, we propose a user-centric view of "jailbreaking" as an emergent, non-violent de-escalation practice. Online users engage with suspected LLM-powered accounts to circumvent large language model safeguards, exposing automated behaviour and disrupting the circulation of misleading narratives.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Misinformation and Its Impacts · Terrorism, Counterterrorism, and Political Violence
