Jailbreaking Large Language Models with Symbolic Mathematics
Emet Bethany, Mazal Bethany, Juan Arturo Nolazco Flores, Sumit Kumar, Jha, Peyman Najafirad

TL;DR
This paper reveals a new vulnerability in large language models where encoding harmful prompts as symbolic math problems can bypass safety measures, exposing the need for more comprehensive safety testing.
Contribution
Introduces MathPrompt, a novel method exploiting LLMs' symbolic math abilities to bypass safety mechanisms, demonstrating significant vulnerabilities in current AI safety approaches.
Findings
Average attack success rate of 73.6% across 13 LLMs
Semantic shift in embeddings explains attack effectiveness
Highlights need for broader safety testing methods
Abstract
Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs' advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6\%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Computational Physics and Python Applications · Computability, Logic, AI Algorithms
