FireBERT: Hardening BERT-based classifiers against adversarial attack
Gunnar Mein, Kevin Hartman, Andrew Morris

TL;DR
FireBERT introduces three methods to enhance BERT classifiers' robustness against adversarial word-perturbation attacks, maintaining high accuracy on both regular and adversarial samples through co-tuning and evaluation-time perturbations.
Contribution
The paper proposes novel techniques for hardening BERT-based classifiers against adversarial attacks, including co-tuning with synthetic data and evaluation-time perturbations, demonstrating significant robustness improvements.
Findings
Co-tuning with synthetic data protects against 95% of adversarial samples.
Evaluation-time perturbation restores up to 75% of original accuracy under attack.
Methods maintain high accuracy on regular benchmark samples.
Abstract
We present FireBERT, a set of three proof-of-concept NLP classifiers hardened against TextFooler-style word-perturbation by producing diverse alternatives to original samples. In one approach, we co-tune BERT against the training data and synthetic adversarial samples. In a second approach, we generate the synthetic samples at evaluation time through substitution of words and perturbation of embedding vectors. The diversified evaluation results are then combined by voting. A third approach replaces evaluation-time word substitution with perturbation of embedding vectors. We evaluate FireBERT for MNLI and IMDB Movie Review datasets, in the original and on adversarial examples generated by TextFooler. We also test whether TextFooler is less successful in creating new adversarial samples when manipulating FireBERT, compared to working on unhardened classifiers. We show that it is possible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Network Security and Intrusion Detection
MethodsLinear Layer · WordPiece · Dense Connections · Linear Warmup With Linear Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Attention Is All You Need · Multi-Head Attention · Dropout · Adam
