MPAT: Building Robust Deep Neural Networks against Textual Adversarial Attacks
Fangyuan Zhang, Huichi Zhou, Shuangjiao Li, Hongtao Wang

TL;DR
This paper introduces MPAT, a novel adversarial training approach that enhances the robustness of deep neural networks against textual adversarial attacks without sacrificing task performance.
Contribution
The paper proposes a multi-level malicious perturbation generation strategy and a new training objective to improve defense effectiveness and maintain original task accuracy.
Findings
Outperforms previous defenses against malicious adversarial attacks
Maintains or improves original task performance
Effective across multiple models and datasets
Abstract
Deep neural networks have been proven to be vulnerable to adversarial examples and various methods have been proposed to defend against adversarial attacks for natural language processing tasks. However, previous defense methods have limitations in maintaining effective defense while ensuring the performance of the original task. In this paper, we propose a malicious perturbation based adversarial training method (MPAT) for building robust deep neural networks against textual adversarial attacks. Specifically, we construct a multi-level malicious example generation strategy to generate adversarial examples with malicious perturbations, which are used instead of original inputs for model training. Additionally, we employ a novel training objective function to ensure achieving the defense goal without compromising the performance on the original task. We conduct comprehensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection
