Alphabet Index Mapping: Jailbreaking LLMs through Semantic Dissimilarity
Bilal Saleh Husain

TL;DR
This paper introduces Alphabet Index Mapping (AIM), a novel adversarial attack that maximizes semantic dissimilarity to effectively jailbreak large language models like GPT-4, outperforming existing methods.
Contribution
The paper proposes AIM, a new attack method that balances semantic dissimilarity and simplicity, providing a deeper understanding of prompt manipulation for model jailbreaks.
Findings
AIM achieves a 94% attack success rate on GPT-4.
Semantic dissimilarity correlates inversely with attack success.
AIM outperforms FlipAttack and other methods on AdvBench subset.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their susceptibility to adversarial attacks, particularly jailbreaking, poses significant safety and ethical concerns. While numerous jailbreak methods exist, many suffer from computational expense, high token usage, or complex decoding schemes. Liu et al. (2024) introduced FlipAttack, a black-box method that achieves high attack success rates (ASR) through simple prompt manipulation. This paper investigates the underlying mechanisms of FlipAttack's effectiveness by analyzing the semantic changes induced by its flipping modes. We hypothesize that semantic dissimilarity between original and manipulated prompts is inversely correlated with ASR. To test this, we examine embedding space visualizations (UMAP, KDE) and cosine similarities for FlipAttack's modes. Furthermore, we introduce a novel adversarial attack,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Law, AI, and Intellectual Property · Digital and Cyber Forensics
MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer · GPT-4
