PL-Guard: Benchmarking Language Model Safety for Polish
Aleksandra Krasnod\k{e}bska, Karolina Seweryn, Szymon {\L}ukasik, Wojciech Kusa

TL;DR
This paper introduces a Polish safety benchmark dataset for language models, evaluates various models' robustness and performance, and finds that a HerBERT-based classifier performs best, especially against adversarial samples.
Contribution
It provides the first manually annotated safety benchmark for Polish LLMs and evaluates model robustness using adversarially perturbed samples.
Findings
HerBERT-based classifier outperforms other models in safety classification
Models show decreased performance under adversarial perturbations
Fine-tuning on annotated data improves model safety detection
Abstract
Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling
