ModelCitizens: Representing Community Voices in Online Safety
Ashima Suvarna, Christina Chance, Karolina Naranjo, Hamid Palangi, Sophie Hao, Thomas Hartvigsen, Saadia Gabriel

TL;DR
This paper introduces MODELCITIZENS, a community-informed dataset for toxicity detection on social media, highlighting the importance of diverse perspectives and context in improving moderation tools.
Contribution
It presents a new dataset with community-specific annotations, augmented conversational context, and fine-tuned models that outperform existing toxicity detection tools.
Findings
State-of-the-art tools underperform on community-informed data
Context augmentation degrades detection accuracy
Fine-tuned models outperform GPT-o4-mini by 5.5%
Abstract
Automatic toxic language detection is critical for creating safe, inclusive online spaces. However, it is a highly subjective task, with perceptions of toxic language shaped by community norms and lived experience. Existing toxicity detection models are typically trained on annotations that collapse diverse annotator perspectives into a single ground truth, erasing important context-specific notions of toxicity such as reclaimed language. To address this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K toxicity annotations across diverse identity groups. To capture the role of conversational context on toxicity, typical of social media posts, we augment MODELCITIZENS posts with LLM-generated conversational scenarios. State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API, GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Misinformation and Its Impacts
