Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita

TL;DR
This paper reveals that harmful prompts can bypass LLM safety filters by encoding them as mathematical problems, exposing fundamental vulnerabilities in current safety mechanisms.
Contribution
It introduces a novel formal logic encoding method to systematically analyze and demonstrate safety gaps in large language models.
Findings
Encoding prompts as mathematical problems achieves 46-56% attack success.
Deep reformulation by helper LLM is crucial for attack effectiveness.
Newer models like GPT-5 are more robust but still vulnerable.
Abstract
Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using formalisms such as set theory, formal logic, and quantum mechanics -- bypasses these filters at high rates, achieving 46%--56% average attack success across eight target models and two established benchmarks. Crucially, the effectiveness depends not on mathematical notation itself, but on whether a helper LLM deeply reformulates the harmful content into a genuine mathematical problem: rule-based encodings that apply mathematical formatting without such reformulation perform no better than unencoded baselines. We introduce a novel Formal Logic encoding that achieves attack success comparable to Set Theory, demonstrating that this vulnerability generalizes across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
