Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach

Yidong Chai; Yi Liu; Mohammadreza Ebrahimi; Weifeng Li; Balaji Padmanabhan

arXiv:2512.17367·cs.LG·December 30, 2025

Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach

Yidong Chai, Yi Liu, Mohammadreza Ebrahimi, Weifeng Li, Balaji Padmanabhan

PDF

Open Access

TL;DR

This paper introduces a novel framework and detector for identifying harmful online content that is robust against adversarial attacks, combining innovative training strategies and ensemble methods to enhance generalizability and accuracy.

Contribution

It proposes the LLM-SGA framework and ARHOCD detector, integrating novel ensemble, weighting, and adversarial training techniques for improved robustness against adversarial content detection.

Findings

01

ARHOCD outperforms existing methods in robustness and accuracy

02

Demonstrates effectiveness across hate speech, rumor, and extremist datasets

03

Achieves strong generalizability against diverse adversarial attacks

Abstract

Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric. Machine learning (ML) models are widely adopted to detect such content; however, they remain highly vulnerable to adversarial attacks, wherein malicious users subtly modify text to evade detection. Enhancing adversarial robustness is therefore essential, requiring detectors that can defend against diverse attacks (generalizability) while maintaining high overall accuracy. However, simultaneously achieving both optimal generalizability and accuracy is challenging. Following the computational design science paradigm, this study takes a sequential approach that first proposes a novel framework (Large Language Model-based Sample Generation and Aggregation, LLM-SGA) by identifying the key invariances of textual adversarial attacks and leveraging them to ensure that a detector…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Misinformation and Its Impacts · Adversarial Robustness in Machine Learning