LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content
Jessica Foo, Shaun Khoo

TL;DR
LionGuard is a Singapore-specific moderation classifier that improves safety detection for local languages like Singlish, outperforming generic APIs and emphasizing the importance of localization in moderation tools.
Contribution
We introduce LionGuard, a localized moderation classifier tailored for Singaporean context, demonstrating significant performance gains over existing non-localized moderation APIs.
Findings
LionGuard outperforms existing APIs by 14-51% on Singlish data.
Localization enhances moderation accuracy for low-resource languages.
The approach is practical and scalable for diverse language contexts.
Abstract
As large language models (LLMs) become increasingly prevalent in a wide variety of applications, concerns about the safety of their outputs have become more significant. Most efforts at safety-tuning or moderation today take on a predominantly Western-centric view of safety, especially for toxic, hateful, or violent speech. In this paper, we describe LionGuard, a Singapore-contextualized moderation classifier that can serve as guardrails against unsafe LLM outputs. When assessed on Singlish data, LionGuard outperforms existing widely-used moderation APIs, which are not finetuned for the Singapore context, by 14% (binary) and up to 51% (multi-label). Our work highlights the benefits of localization for moderation classifiers and presents a practical and scalable approach for low-resource languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
