Adversarial Attacks and Defense for Conversation Entailment Task
Zhenning Yang, Ryan Krawec, Liang-Yuan Wu

TL;DR
This paper investigates adversarial attacks on conversation entailment models and proposes novel fine-tuning and embedding perturbation techniques to improve their robustness against such attacks.
Contribution
It introduces new defense methods, including embedding perturbation loss, to enhance transformer models' robustness in conversation entailment tasks against adversarial manipulations.
Findings
Fine-tuning techniques improve model accuracy under attack
Embedding perturbation loss enhances robustness significantly
Defense methods have real-world implications for reliable NLP applications
Abstract
As the deployment of NLP systems in critical applications grows, ensuring the robustness of large language models (LLMs) against adversarial attacks becomes increasingly important. Large language models excel in various NLP tasks but remain vulnerable to low-cost adversarial attacks. Focusing on the domain of conversation entailment, where multi-turn dialogues serve as premises to verify hypotheses, we fine-tune a transformer model to accurately discern the truthfulness of these hypotheses. Adversaries manipulate hypotheses through synonym swapping, aiming to deceive the model into making incorrect predictions. To counteract these attacks, we implemented innovative fine-tuning techniques and introduced an embedding perturbation loss method to significantly bolster the model's robustness. Our findings not only emphasize the importance of defending against adversarial attacks in NLP but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
