On Jailbreaking Quantized Language Models Through Fault Injection Attacks

Noureldin Zahran; Ahmad Tahmasivand; Ihsen Alouani; Khaled Khasawneh; Mohammed E. Fouda

arXiv:2507.03236·cs.CR·July 10, 2025

On Jailbreaking Quantized Language Models Through Fault Injection Attacks

Noureldin Zahran, Ahmad Tahmasivand, Ihsen Alouani, Khaled Khasawneh, Mohammed E. Fouda

PDF

Open Access

TL;DR

This paper investigates how fault injection attacks can jailbreak quantized language models, revealing that lower-precision quantization schemes like FP8 and INT8 can reduce attack success but vulnerabilities remain, especially after post-attack quantization.

Contribution

It introduces gradient-guided fault injection attacks tailored for quantized LMs and evaluates their effectiveness across various quantization schemes, highlighting differences in vulnerability.

Findings

01

High attack success (>80%) on FP16 models

02

FP8 and INT8 models show reduced attack success (<50%)

03

Transferability of jailbreaks is high across FP16, FP8, and INT8 models

Abstract

The safety alignment of Language Models (LMs) is a critical concern, yet their integrity can be challenged by direct parameter manipulation attacks, such as those potentially induced by fault injection. As LMs are increasingly deployed using low-precision quantization for efficiency, this paper investigates the efficacy of such attacks for jailbreaking aligned LMs across different quantization schemes. We propose gradient-guided attacks, including a tailored progressive bit-level search algorithm introduced herein and a comparative word-level (single weight update) attack. Our evaluation on Llama-3.2-3B, Phi-4-mini, and Llama-3-8B across FP16 (baseline), and weight-only quantization (FP8, INT8, INT4) reveals that quantization significantly influences attack success. While attacks readily achieve high success (>80% Attack Success Rate, ASR) on FP16 models, within an attack budget of 25…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques