ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat

TL;DR
This paper introduces ThaiSafetyBench, a benchmark for evaluating Thai language model safety, revealing vulnerabilities in open-source models to culturally specific attacks and proposing a fine-tuned classifier for improved safety assessment.
Contribution
The work presents ThaiSafetyBench, a culturally nuanced safety benchmark for Thai LLMs, and develops ThaiSafetyClassifier, a fine-tuned model matching GPT-4.1 judgments for safety detection.
Findings
Closed-source models are generally safer than open-source models.
Culturally grounded attacks have higher success rates, exposing safety vulnerabilities.
The ThaiSafetyClassifier achieves an 84.4% F1 score, matching GPT-4.1 judgments.
Abstract
The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non-English languages and culturally grounded risks underexplored. In this work, we investigate LLM safety in the context of the Thai language and culture and introduce ThaiSafetyBench, an open-source benchmark comprising 1,954 malicious prompts written in Thai. The dataset covers both general harmful prompts and attacks that are explicitly grounded in Thai cultural, social, and contextual nuances. Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators. Our results show that closed-source models generally demonstrate stronger safety performance than open-source counterparts, raising important concerns regarding the robustness of openly available models. Moreover, we observe a consistently higher Attack Success Rate (ASR) for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
