MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance
Agam Goyal, Xianyang Zhan, Yilun Chen, Koustuv Saha, Eshwar Chandrasekharan

TL;DR
MoMoE is a modular, explainable framework for online content moderation that scales across communities, matching or surpassing traditional models in accuracy while providing transparent explanations.
Contribution
Introduces MoMoE, a scalable, multi-expert framework with post-hoc explanations for cross-community content moderation, reducing the need for per-community fine-tuning.
Findings
Achieves high accuracy on unseen subreddits with expert ensembles.
Provides reliable, concise explanations for moderation decisions.
Demonstrates steady performance of norm-violation experts across domains.
Abstract
Large language models (LLMs) have shown great potential in flagging harmful content in online communities. Yet, existing approaches for moderation require a separate model for every community and are opaque in their decision-making, limiting real-world adoption. We introduce Mixture of Moderation Experts (MoMoE), a modular, cross-community framework that adds post-hoc explanations to scalable content moderation. MoMoE orchestrates four operators -- Allocate, Predict, Aggregate, Explain -- and is instantiated as seven community-specialized experts (MoMoE-Community) and five norm-violation experts (MoMoE-NormVio). On 30 unseen subreddits, the best variants obtain Micro-F1 scores of 0.72 and 0.67, respectively, matching or surpassing strong fine-tuned baselines while consistently producing concise and reliable explanations. Although community-specialized experts deliver the highest peak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Misinformation and Its Impacts · Spam and Phishing Detection
