On the Risk of Evidence Pollution for Malicious Social Text Detection in the Era of LLMs
Herun Wan, Minnan Luo, Zhixiong Su, Guang Dai, Xiang Zhao

TL;DR
This paper investigates how evidence pollution from LLMs can undermine malicious social text detectors and proposes strategies to mitigate this risk, highlighting significant performance drops caused by polluted evidence.
Contribution
It identifies the threat of evidence pollution from LLMs to social text detection and evaluates mitigation strategies, providing insights into their limitations and impact.
Findings
Evidence pollution causes up to 14.4% performance drop.
Polluted evidence is high quality and affects model calibration.
Mitigation strategies can reduce impact but face practical limitations.
Abstract
Evidence-enhanced detectors present remarkable abilities in identifying malicious social text. However, the rise of large language models (LLMs) brings potential risks of evidence pollution to confuse detectors. This paper explores potential manipulation scenarios including basic pollution, and rephrasing or generating evidence by LLMs. To mitigate the negative impact, we propose three defense strategies from the data and model sides, including machine-generated text detection, a mixture of experts, and parameter updating. Extensive experiments on four malicious social text detection tasks with ten datasets illustrate that evidence pollution significantly compromises detectors, where the generating strategy causes up to a 14.4% performance drop. Meanwhile, the defense strategies could mitigate evidence pollution, but they faced limitations for practical employment. Further analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpam and Phishing Detection · Digital and Cyber Forensics
