Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach
Yidong Chai, Yi Liu, Mohammadreza Ebrahimi, Weifeng Li, Balaji Padmanabhan

TL;DR
This paper introduces a novel framework and detector for identifying harmful online content that is robust against adversarial attacks, combining innovative training strategies and ensemble methods to enhance generalizability and accuracy.
Contribution
It proposes the LLM-SGA framework and ARHOCD detector, integrating novel ensemble, weighting, and adversarial training techniques for improved robustness against adversarial content detection.
Findings
ARHOCD outperforms existing methods in robustness and accuracy
Demonstrates effectiveness across hate speech, rumor, and extremist datasets
Achieves strong generalizability against diverse adversarial attacks
Abstract
Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric. Machine learning (ML) models are widely adopted to detect such content; however, they remain highly vulnerable to adversarial attacks, wherein malicious users subtly modify text to evade detection. Enhancing adversarial robustness is therefore essential, requiring detectors that can defend against diverse attacks (generalizability) while maintaining high overall accuracy. However, simultaneously achieving both optimal generalizability and accuracy is challenging. Following the computational design science paradigm, this study takes a sequential approach that first proposes a novel framework (Large Language Model-based Sample Generation and Aggregation, LLM-SGA) by identifying the key invariances of textual adversarial attacks and leveraging them to ensure that a detector…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Misinformation and Its Impacts · Adversarial Robustness in Machine Learning
