The Content Moderator's Dilemma: Removal of Toxic Content and Distortions to Online Discourse
Mahyar Habibi, Dirk Hovy, Carlo Schwarz

TL;DR
This paper introduces a methodology to measure how content moderation affects online discourse, revealing that removing toxic content alters semantic content and proposing a rephrasing approach using large language models to mitigate these effects.
Contribution
It presents a novel method for quantifying moderation-induced distortions and proposes an LLM-based rephrasing strategy to preserve content while reducing toxicity.
Findings
Removing toxic Tweets changes semantic content across models and metrics.
Toxic content removal affects topics often expressed in toxic language.
Rephrasing with LLMs reduces toxicity and minimizes content distortion.
Abstract
There is an ongoing debate about how to moderate toxic speech on social media and the impact of content moderation on online discourse. This paper proposes and validates a methodology for measuring the content-moderation-induced distortions in online discourse using text embeddings from computational linguistics. Applying the method to a representative sample of 5 million US political Tweets, we find that removing toxic Tweets alters the semantic composition of content. This finding is consistent across different embedding models, toxicity metrics, and samples. Importantly, we demonstrate that these effects are not solely driven by toxic language but by the removal of topics often expressed in toxic form. We propose an alternative approach to content moderation that uses generative Large Language Models to rephrase toxic Tweets, preserving their salvageable content rather than removing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Freedom of Expression and Defamation
