Unifying Model Explainability and Robustness for Joint Text Classification and Rationale Extraction
Dongfang Li, Baotian Hu, Qingcai Chen, Tujie Xu, Jingcong Tao, Yunan, Zhang

TL;DR
This paper introduces AT-BMC, a joint model for text classification and rationale extraction that enhances robustness against adversarial attacks and improves explanation quality by combining adversarial training and boundary-guided rationale localization.
Contribution
It presents a novel joint model that unifies explainability and robustness, leveraging mixed adversarial training and boundary match constraints for improved performance.
Findings
Outperforms baselines in classification and rationale extraction.
Reduces attack success rate by up to 69%.
Shows a connection between robustness and better explanations.
Abstract
Recent works have shown explainability and robustness are two crucial ingredients of trustworthy and reliable text classification. However, previous works usually address one of two aspects: i) how to extract accurate rationales for explainability while being beneficial to prediction; ii) how to make the predictive model robust to different types of adversarial attacks. Intuitively, a model that produces helpful explanations should be more robust against adversarial attacks, because we cannot trust the model that outputs explanations but changes its prediction under small perturbations. To this end, we propose a joint classification and rationale extraction model named AT-BMC. It includes two key mechanisms: mixed Adversarial Training (AT) is designed to use various perturbations in discrete and embedding space to improve the model's robustness, and Boundary Match Constraint (BMC) helps…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning
