Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLMs: Comprehensive Analysis and Defense

Jiawen Zhang; Kejia Chen; Lipeng He; Jian Lou; Dan Li; Zunlei Feng; Mingli Song; Jian Liu; Kui Ren; Xiaohu Yang

arXiv:2502.00840·cs.CR·June 11, 2025

Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLMs: Comprehensive Analysis and Defense

Jiawen Zhang, Kejia Chen, Lipeng He, Jian Lou, Dan Li, Zunlei Feng, Mingli Song, Jian Liu, Kui Ren, Xiaohu Yang

PDF

Open Access

TL;DR

This paper systematically evaluates the safety risks of activation approximation techniques in large language models and proposes QuadA, a novel defense method to mitigate these safety vulnerabilities.

Contribution

It provides the first comprehensive safety analysis of activation approximations in LLMs and introduces QuadA, a new method to improve safety post-approximation.

Findings

01

Activation approximations can degrade LLM safety across multiple techniques.

02

QuadA effectively mitigates safety issues caused by activation approximations.

03

Safety vulnerabilities are consistent across various state-of-the-art LLMs.

Abstract

Large Language Models (LLMs) have showcased remarkable capabilities across various domains. Accompanying the evolving capabilities and expanding deployment scenarios of LLMs, their deployment challenges escalate due to their sheer scale and the advanced yet complex activation designs prevalent in notable model series, such as Llama, Gemma, Mistral. These challenges have become particularly pronounced in resource-constrained deployment scenarios, where mitigating inference bottlenecks is imperative. Among various recent efforts, activation approximation has emerged as a promising avenue for pursuing inference efficiency, sometimes considered indispensable in applications such as private inference. Despite achieving substantial speedups with minimal impact on utility, even appearing sound and practical for real-world deployment, the safety implications of activation approximations remain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRisk and Safety Analysis