Jailbreaking Large Language Models with Symbolic Mathematics

Emet Bethany; Mazal Bethany; Juan Arturo Nolazco Flores; Sumit Kumar; Jha; Peyman Najafirad

arXiv:2409.11445·cs.CR·November 6, 2024

Jailbreaking Large Language Models with Symbolic Mathematics

Emet Bethany, Mazal Bethany, Juan Arturo Nolazco Flores, Sumit Kumar, Jha, Peyman Najafirad

PDF

Open Access

TL;DR

This paper reveals a new vulnerability in large language models where encoding harmful prompts as symbolic math problems can bypass safety measures, exposing the need for more comprehensive safety testing.

Contribution

Introduces MathPrompt, a novel method exploiting LLMs' symbolic math abilities to bypass safety mechanisms, demonstrating significant vulnerabilities in current AI safety approaches.

Findings

01

Average attack success rate of 73.6% across 13 LLMs

02

Semantic shift in embeddings explains attack effectiveness

03

Highlights need for broader safety testing methods

Abstract

Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs' advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6\%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Computational Physics and Python Applications · Computability, Logic, AI Algorithms