Can Language Model Moderators Improve the Health of Online Discourse?
Hyundong Cho, Shuai Liu, Taiwei Shi, Darpan Jain, Basem Rizk, Yuyang, Huang, Zixun Lu, Nuan Wen, Jonathan Gratch, Emilio Ferrara, Jonathan May

TL;DR
This paper explores the potential of language models to assist online community moderation, proposing a new evaluation framework and finding that models can give fair feedback but struggle to promote respectful behavior.
Contribution
It introduces a systematic evaluation framework for language models as moderators and provides the first study assessing their effectiveness in real moderation scenarios.
Findings
Models can give fair feedback on toxicity.
Models struggle to increase user respect and cooperation.
Evaluation framework enables realistic assessment of moderation capabilities.
Abstract
Conversational moderation of online communities is crucial to maintaining civility for a constructive environment, but it is challenging to scale and harmful to moderators. The inclusion of sophisticated natural language generation modules as a force multiplier to aid human moderators is a tantalizing prospect, but adequate evaluation approaches have so far been elusive. In this paper, we establish a systematic definition of conversational moderation effectiveness grounded on moderation literature and establish design criteria for conducting realistic yet safe evaluation. We then propose a comprehensive evaluation framework to assess models' moderation capabilities independently of human intervention. With our framework, we conduct the first known study of language models as conversational moderators, finding that appropriately prompted models that incorporate insights from social…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHate Speech and Cyberbullying Detection
MethodsFocus
