Alert-ME: An Explainability-Driven Defense Against Adversarial Examples in Transformer-Based Text Classification

Bushra Sabir (1); Yansong Gao (2); Alsharif Abuadbba (1); M. Ali Babar (3) ((1) CSIRO's Data61; (2) The University of Western Australia; (3) The University of Adelaide; CREST- The Centre for Research on Engineering Software Technologies)

arXiv:2307.01225·cs.CL·October 27, 2025

Alert-ME: An Explainability-Driven Defense Against Adversarial Examples in Transformer-Based Text Classification

Bushra Sabir (1), Yansong Gao (2), Alsharif Abuadbba (1), M. Ali Babar (3) ((1) CSIRO's Data61, (2) The University of Western Australia, (3) The University of Adelaide, CREST- The Centre for Research on Engineering Software Technologies)

PDF

Open Access

TL;DR

This paper introduces EDIT, a unified, explainability-driven framework that detects, identifies, and transforms adversarial inputs in transformer-based text classifiers, significantly improving robustness and interpretability against various attack types.

Contribution

The paper presents a novel framework combining explainability tools and frequency features for real-time adversarial detection and input transformation in NLP models, enhancing security and interpretability.

Findings

01

Achieves 89.69% F-score and 89.70% accuracy on multiple datasets.

02

Outperforms four state-of-the-art defenses in accuracy and speed.

03

Effectively defends against standard, zero-day, and adaptive attacks.

Abstract

Transformer-based text classifiers such as BERT, RoBERTa, T5, and GPT have shown strong performance in natural language processing tasks but remain vulnerable to adversarial examples. These vulnerabilities raise significant security concerns, as small input perturbations can cause severe misclassifications. Existing robustness methods often require heavy computation or lack interpretability. This paper presents a unified framework called Explainability-driven Detection, Identification, and Transformation (EDIT) to strengthen inference-time defenses. EDIT integrates explainability tools, including attention maps and integrated gradients, with frequency-based features to automatically detect and identify adversarial perturbations while offering insight into model behavior. After detection, EDIT refines adversarial inputs using an optimal transformation process that leverages pre-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Cosine Annealing · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Warmup With Linear Decay · Linear Layer · Adam · Linear Warmup With Cosine Annealing