Adversarial Attacks and Defense for Conversation Entailment Task

Zhenning Yang; Ryan Krawec; Liang-Yuan Wu

arXiv:2405.00289·cs.CL·May 3, 2024·1 cites

Adversarial Attacks and Defense for Conversation Entailment Task

Zhenning Yang, Ryan Krawec, Liang-Yuan Wu

PDF

Open Access

TL;DR

This paper investigates adversarial attacks on conversation entailment models and proposes novel fine-tuning and embedding perturbation techniques to improve their robustness against such attacks.

Contribution

It introduces new defense methods, including embedding perturbation loss, to enhance transformer models' robustness in conversation entailment tasks against adversarial manipulations.

Findings

01

Fine-tuning techniques improve model accuracy under attack

02

Embedding perturbation loss enhances robustness significantly

03

Defense methods have real-world implications for reliable NLP applications

Abstract

As the deployment of NLP systems in critical applications grows, ensuring the robustness of large language models (LLMs) against adversarial attacks becomes increasingly important. Large language models excel in various NLP tasks but remain vulnerable to low-cost adversarial attacks. Focusing on the domain of conversation entailment, where multi-turn dialogues serve as premises to verify hypotheses, we fine-tune a transformer model to accurately discern the truthfulness of these hypotheses. Adversaries manipulate hypotheses through synonym swapping, aiming to deceive the model into making incorrect predictions. To counteract these attacks, we implemented innovative fine-tuning techniques and introduced an embedding perturbation loss method to significantly bolster the model's robustness. Our findings not only emphasize the importance of defending against adversarial attacks in NLP but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection