Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Huizhen Shu, Xuying Li, Qirui Wang, Yuji Kosuga, Mengqiu Tian, Zhuo Li

TL;DR
This paper introduces a novel black-box adversarial attack method for NLP models using sparse autoencoders to identify and perturb critical features, effectively bypassing current defenses and exposing vulnerabilities.
Contribution
The paper presents the Sparse Feature Perturbation Framework (SFPF), a new approach leveraging sparse autoencoders for targeted adversarial text generation in NLP.
Findings
SFPF can generate adversarial texts that bypass state-of-the-art defenses.
The method effectively identifies and manipulates critical features in text.
Effectiveness varies across prompts and model layers.
Abstract
With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes sparse autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
