Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky
Hsuan-Yu Chou, Wajiha Naveed, Shuyan Zhou, and Xiaowei Yang

TL;DR
This study compares open-weight and proprietary large language models for social media moderation on Bluesky, showing open-weight models perform comparably and can support privacy-preserving moderation on consumer hardware.
Contribution
It provides a comprehensive evaluation of open-weight LLMs for social media moderation, highlighting their potential to match proprietary models in real-world settings.
Findings
Open-weight LLMs have 81%-97% sensitivity and 91%-100% specificity.
Open-weight LLMs' performance is comparable to proprietary models.
Specificity is higher for rudeness detection, sensitivity is higher for intolerance and threats.
Abstract
As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight LLMs and those (72%--98%, and 93%--99%) of the proprietary ones.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Spam and Phishing Detection · Misinformation and Its Impacts
