Generating Label Cohesive and Well-Formed Adversarial Claims
Pepa Atanasova, Dustin Wright, and Isabelle Augenstein

TL;DR
This paper develops a method to generate adversarial claims for fact checking systems that preserve semantic meaning and validity, improving attack realism and effectiveness.
Contribution
It introduces a joint optimization approach and a conditional language model to produce more semantically coherent adversarial claims for fact checking.
Findings
Generated attacks better preserve claim semantics.
Attacks maintain directionality more effectively.
Improved adversarial attack quality over previous methods.
Abstract
Adversarial attacks reveal important vulnerabilities and flaws of trained models. One potent type of attack are universal adversarial triggers, which are individual n-grams that, when appended to instances of a class under attack, can trick a model into predicting a target class. However, for inference tasks such as fact checking, these triggers often inadvertently invert the meaning of instances they are inserted in. In addition, such attacks produce semantically nonsensical inputs, as they simply concatenate triggers to existing samples. Here, we investigate how to generate adversarial attacks against fact checking systems that preserve the ground truth meaning and are semantically valid. We extend the HotFlip attack algorithm used for universal trigger generation by jointly minimising the target class loss of a fact checking model and the entailment class loss of an auxiliary natural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
