Safety-Preserving PTQ via Contrastive Alignment Loss
Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak

TL;DR
This paper introduces Contrastive Alignment Quantization (CAQ), a novel PTQ method that enhances safety alignment in quantized large language models by integrating a contrastive loss, improving robustness without extra data or computational cost.
Contribution
The paper proposes CAQ, a new PTQ approach that incorporates a contrastive alignment loss to improve safety and behavioral fidelity in quantized models, addressing a key limitation of existing methods.
Findings
CAQ achieves superior safety alignment in 4-bit quantization.
CAQ maintains model capabilities while improving safety.
No additional safety datasets or significant computational overhead required.
Abstract
Post-Training Quantization (PTQ) has become the de-facto standard for efficient LLM deployment, yet its optimization objective remains fundamentally incomplete. Standard PTQ methods minimize reconstruction error (e.g., MSE or KL divergence) without accounting for behavioral alignment--a critical property instilled through safety fine-tuning. We demonstrate that this objective mismatch introduces a systematic vulnerability: models can maintain low perplexity yet exhibit significant degradation in safety alignment, revealing that perplexity alone is an insufficient and often misleading proxy for deployment readiness. To address this, we propose Contrastive Alignment Quantization (CAQ), which extends the PTQ objective design space by integrating a Contrastive Alignment Loss (CAL). CAL introduces a principled push-pull mechanism that jointly optimizes distributional fidelity and behavioral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques
