Guided Perturbation Sensitivity (GPS): Detecting Adversarial Text via Embedding Stability and Word Importance

Bryan E. Tuck; Rakesh M. Verma

arXiv:2508.11667·cs.LG·January 30, 2026

Guided Perturbation Sensitivity (GPS): Detecting Adversarial Text via Embedding Stability and Word Importance

Bryan E. Tuck, Rakesh M. Verma

PDF

Open Access 1 Video

TL;DR

The paper introduces Guided Perturbation Sensitivity (GPS), a novel attack-agnostic framework that detects adversarial text by analyzing embedding stability and word importance, achieving high accuracy across various datasets and models.

Contribution

GPS is a new detection method that measures embedding sensitivity to masked important words, outperforming existing methods and generalizing well without retraining.

Findings

01

Achieves over 85% detection accuracy across datasets and attacks.

02

Gradient-based word importance ranking outperforms other heuristics.

03

GPS generalizes to unseen datasets, attacks, and models without retraining.

Abstract

Adversarial text attacks remain a persistent threat to transformer models, yet existing defenses are typically attack-specific or require costly model retraining, leaving a gap for attack-agnostic detection. We introduce Guided Perturbation Sensitivity (GPS), a detection framework that identifies adversarial examples by measuring how embedding representations change when important words are masked. GPS first ranks words using importance heuristics, then measures embedding sensitivity to masking top-k critical words, and processes the resulting patterns with a BiLSTM detector. Experiments show that adversarially perturbed words exhibit disproportionately high masking sensitivity compared to naturally important words. Across three datasets, three attack types, and two victim models, GPS achieves over 85% detection accuracy and demonstrates competitive performance compared to existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Guided Perturbation Sensitivity (GPS): Detecting Adversarial Text via Embedding Stability and Word Importance· underline

Taxonomy

TopicsPower Transformer Diagnostics and Insulation