TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts

Torsten Krau{\ss}; Hamid Dashtbani; Alexandra Dmitrienko

arXiv:2506.07596·cs.LG·June 10, 2025

TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts

Torsten Krau{\ss}, Hamid Dashtbani, Alexandra Dmitrienko

PDF

Open Access

TL;DR

TwinBreak is a novel method that identifies and prunes safety-related parameters in large language models, effectively bypassing safety mechanisms with high success rates and minimal computational costs.

Contribution

It introduces a fine-grained analysis technique focusing on intermediate outputs to isolate safety parameters, a first in the field.

Findings

01

Achieves 89% to 98% success in bypassing safety mechanisms.

02

Works effectively across 16 LLMs from five vendors.

03

Requires minimal computational resources.

Abstract

Machine learning is advancing rapidly, with applications bringing notable benefits, such as improvements in translation and code generation. Models like ChatGPT, powered by Large Language Models (LLMs), are increasingly integrated into daily life. However, alongside these benefits, LLMs also introduce social risks. Malicious users can exploit LLMs by submitting harmful prompts, such as requesting instructions for illegal activities. To mitigate this, models often include a security mechanism that automatically rejects such harmful prompts. However, they can be bypassed through LLM jailbreaks. Current jailbreaks often require significant manual effort, high computational costs, or result in excessive model modifications that may degrade regular utility. We introduce TwinBreak, an innovative safety alignment removal method. Building on the idea that the safety mechanism operates like an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Advanced Malware Detection Techniques