Adversarial Contrastive Learning for LLM Quantization Attacks
Dinghong Song, Zhiwei Xu, Hai Wan, Xibin Zhao, Pengfei Su, Dong Li

TL;DR
This paper introduces Adversarial Contrastive Learning (ACL), a gradient-based attack method that significantly enhances the ability to exploit security vulnerabilities in quantized large language models, posing new challenges for model safety.
Contribution
The paper presents a novel contrastive learning-based attack framework for LLM quantization, improving attack success rates and outperforming existing methods.
Findings
ACL achieves attack success rates up to 97.69%.
ACL outperforms state-of-the-art methods by up to 50.80%.
The method demonstrates high effectiveness across multiple attack types.
Abstract
Model quantization is critical for deploying large language models (LLMs) on resource-constrained hardware, yet recent work has revealed severe security risks that benign LLMs in full precision may exhibit malicious behaviors after quantization. In this paper, we propose Adversarial Contrastive Learning (ACL), a novel gradient-based quantization attack that achieves superior attack effectiveness by explicitly maximizing the gap between benign and harmful responses probabilities. ACL formulates the attack objective as a triplet-based contrastive loss, and integrates it with a projected gradient descent two-stage distributed fine-tuning strategy to ensure stable and efficient optimization. Extensive experiments demonstrate ACL's remarkable effectiveness, achieving attack success rates of 86.00% for over-refusal, 97.69% for jailbreak, and 92.40% for advertisement injection, substantially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Wireless Signal Modulation Classification
