BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage
Kalyan Nakka, Nitesh Saxena

TL;DR
This paper introduces BitBypass, a novel black-box attack that exploits bitstream camouflage to effectively jailbreak aligned large language models, bypassing safety measures and generating harmful content.
Contribution
It presents a new attack method based on bitstream camouflage, revealing vulnerabilities in current safety alignment techniques of LLMs.
Findings
BitBypass successfully bypasses safety alignment in five state-of-the-art LLMs.
It outperforms existing jailbreak methods in stealthiness and success rate.
The attack exploits fundamental data representation, not prompt engineering.
Abstract
The inherent risk of generating harmful and unsafe content by Large Language Models (LLMs), has highlighted the need for their safety alignment. Various techniques like supervised fine-tuning, reinforcement learning from human feedback, and red-teaming were developed for ensuring the safety alignment of LLMs. However, the robustness of these aligned LLMs is always challenged by adversarial attacks that exploit unexplored and underlying vulnerabilities of the safety alignment. In this paper, we develop a novel black-box jailbreak attack, called BitBypass, that leverages hyphen-separated bitstream camouflage for jailbreaking aligned LLMs. This represents a new direction in jailbreaking by exploiting fundamental information representation of data as continuous bits, rather than leveraging prompt engineering or adversarial manipulations. Our evaluation of five state-of-the-art LLMs, namely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Digital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis
MethodsLLaMA
