Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models

Piotr Przyby{\l}a; Euan McGill; Horacio Saggion

arXiv:2410.20940·cs.CL·September 4, 2025

Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models

Piotr Przyby{\l}a, Euan McGill, Horacio Saggion

PDF

Open Access 1 Video

TL;DR

This paper presents TREPAT, a method using large language models to generate adversarial examples that test the robustness of social media misinformation detection algorithms under realistic query constraints.

Contribution

We introduce TREPAT, a novel approach combining NLP rephrasings and beam search to effectively attack content moderation classifiers with limited queries.

Findings

01

TREPAT outperforms baseline methods in constrained scenarios

02

Long news articles are more vulnerable to adversarial attacks

03

The approach is effective across various models and prompts

Abstract

Large language models have many beneficial applications, but can they also be used to attack content-filtering algorithms in social media platforms? We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, such as text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure, until the victim classifier changes its decision. We perform (1) quantitative evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Misinformation and Its Impacts · Topic Modeling

MethodsFocus