Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models
Piotr Przyby{\l}a, Euan McGill, Horacio Saggion

TL;DR
This paper presents TREPAT, a method using large language models to generate adversarial examples that test the robustness of social media misinformation detection algorithms under realistic query constraints.
Contribution
We introduce TREPAT, a novel approach combining NLP rephrasings and beam search to effectively attack content moderation classifiers with limited queries.
Findings
TREPAT outperforms baseline methods in constrained scenarios
Long news articles are more vulnerable to adversarial attacks
The approach is effective across various models and prompts
Abstract
Large language models have many beneficial applications, but can they also be used to attack content-filtering algorithms in social media platforms? We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, such as text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure, until the victim classifier changes its decision. We perform (1) quantitative evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Misinformation and Its Impacts · Topic Modeling
MethodsFocus
