PL-Guard: Benchmarking Language Model Safety for Polish

Aleksandra Krasnod\k{e}bska; Karolina Seweryn; Szymon {\L}ukasik; Wojciech Kusa

arXiv:2506.16322·cs.CL·June 23, 2025

PL-Guard: Benchmarking Language Model Safety for Polish

Aleksandra Krasnod\k{e}bska, Karolina Seweryn, Szymon {\L}ukasik, Wojciech Kusa

PDF

Open Access 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces a Polish safety benchmark dataset for language models, evaluates various models' robustness and performance, and finds that a HerBERT-based classifier performs best, especially against adversarial samples.

Contribution

It provides the first manually annotated safety benchmark for Polish LLMs and evaluates model robustness using adversarially perturbed samples.

Findings

01

HerBERT-based classifier outperforms other models in safety classification

02

Models show decreased performance under adversarial perturbations

03

Fine-tuning on annotated data improves model safety detection

Abstract

Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
NASK-PIB/HerBERT-PL-Guard
model· 70 dl· ♡ 2
70 dl♡ 2

Datasets

NASK-PIB/PL-Guard
dataset· 4 dl
4 dl

Videos

PL-Guard: Benchmarking Language Model Safety for Polish· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling