SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures
Panuthep Tasawong, Jian Gang Ngui, Alham Fikri Aji, Trevor Cohn, Peerat Limkonchotiwat

TL;DR
SEA-SafeguardBench is a comprehensive, human-verified safety benchmark for Southeast Asian languages, revealing that current LLM safety measures underperform on culturally nuanced and region-specific harmful content.
Contribution
The paper introduces SEA-SafeguardBench, the first native, human-verified safety benchmark for SEA languages, addressing linguistic and cultural gaps in existing safety evaluations.
Findings
State-of-the-art LLMs struggle with SEA cultural harm scenarios.
Current guardrails underperform on SEA language content.
English-centric benchmarks do not capture regional safety nuances.
Abstract
Safeguard models help large language models (LLMs) detect and block harmful content, but most evaluations remain English-centric and overlook linguistic and cultural diversity. Existing multilingual safety benchmarks often rely on machine-translated English data, which fails to capture nuances in low-resource languages. Southeast Asian (SEA) languages are underrepresented despite the region's linguistic diversity and unique safety concerns, from culturally sensitive political speech to region-specific misinformation. Addressing these gaps requires benchmarks that are natively authored to reflect local norms and harm scenarios. We introduce SEA-SafeguardBench, the first human-verified safety benchmark for SEA, covering eight languages, 21,640 samples, across three subsets: general, in-the-wild, and content generation. The experimental results from our benchmark demonstrate that even…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- fills an important gap in LLM safety evaluation beyond English - well constructed benchmark with high-quality human annotations - focus on the difference between generic (Western-centric) and culturally sensitive evaluation - meaningful annotation setup where safety vs sensitive content is measured through annotator agreement - inclusion of input and output safety - The analysis conducted beyond the benchmark contribution reveals some interesting insights into the differences between sensit
# Major Weaknesses **Word Overlap Analysis** - the usage of tokens instead of lemmata here makes little sense, since tokens are subwords as well as case and whitepace sensitive - additionally there are duplicate tokens for conjugations/declinations with significant semantic overlap - a more standard NLP analysis that eliminates stop-words and considering lemmata would be more useful - in addition the existence of new words alone is a decent indicator of a signficant change but a more detaile
1. The benchmark is the first to focus on safety evaluation in Southeast‑Asian languages, filling a void in existing multilingual benchmarks that rely on machine‑translated English data. 2. The dataset includes 21,640 samples across three carefully defined subsets (General, In-the-Wild, and Content Generation), each serving a different safety dimension, with strong annotation protocols (multiple annotators, majority voting, sensitive-label handling). 3. The experiments on 20 safeguard models rev
1. Although human-verified, both the General and CG cultural subsets originate from Google NMT translations, which may imprint English framing and safety priors and weakens the paper's "native authoring" claim. 2. The paper reveals gaps but stops short of giving clear design principles for culturally-aligned safeguard modeling (beyond needing more data). 3. No systematic taxonomy of “cultural topics” is presented. The notion of “culture” as used in the benchmark stays conceptually vague. 4. The
1. The authors assemble culturally grounded SEA data across 7 SEA languages + English, with native SEA speakers writing/validating content and details on annotator hiring, pay, QA, which is important for ethics and authenticity. This fills a gap left by English-centric or translation-only datasets, which were done by prior work. The authors also show that current models cannot cover SEA-language effectively by validating performance degradation on guard models and language models. 2. The paper
1. Sensitive response are assigned label "safe" in prompts but "unsafe" in responses, which seems like an arbitrary choice, especially given that the sensitive samples represent ambiguous cases where no clear majority has been reached. The experiment setting would have been more persuasive if you have excluded sensitive samples, or treated them as separate classes. 2. There exists severe class imbalance in the Content-Generation subset (view A.6). Generating harmful responses with jailbreaking
1. Shifts safety evaluation toward under-served SEA languages/cultures, beyond English. 2. Blends translated general safety data with native ITW and CG subsets to probe culture-specific failures. 3. Compares various guardrails/LLMs with clear metrics (AUPRC, threshold sensitivity) and qualitative failure analyses. 4. Results highlight major cross-lingual and cultural performance gaps, reinforcing the need for region-aware safety alignment.
1. Limited novelty beyond prior work (SafeWorld). - The benchmark design—particularly the ITW and CG subsets—closely mirrors SafeWorld: Geo-Diverse Safety Alignment (NeurIPS 2024), which already proposed: - human-verified prompts grounded in local cultural/legal norms, - evaluation of cultural safety across multiple regions, and - analyses of culturally conditioned “unsafe” behaviors. - SEA-SafeguardBench mainly adapts this paradigm to SEA locales. The paper should explicitly position itse
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Hate Speech and Cyberbullying Detection · Topic Modeling
