The Constant in HATE: Analyzing Toxicity in Reddit across Topics and Languages
Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen

TL;DR
This study analyzes toxicity in Reddit comments across multiple languages and topics, revealing patterns of increased toxicity related to specific subjects and notable variations among language communities.
Contribution
It provides a comprehensive cross-lingual, cross-topic analysis of toxicity patterns on Reddit using a large multilingual dataset.
Findings
Toxicity spikes vary by topic and language
Certain topics consistently show higher toxicity levels
Significant within-language community variations observed
Abstract
Toxic language remains an ongoing challenge on social media platforms, presenting significant issues for users and communities. This paper provides a cross-topic and cross-lingual analysis of toxicity in Reddit conversations. We collect 1.5 million comment threads from 481 communities in six languages: English, German, Spanish, Turkish,Arabic, and Dutch, covering 80 topics such as Culture, Politics, and News. We thoroughly analyze how toxicity spikes within different communities in relation to specific topics. We observe consistent patterns of increased toxicity across languages for certain topics, while also noting significant variations within specific language communities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Hate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining
