The Problem with Safety Classification is not just the Models

Sowmya Vajjala

arXiv:2507.21782·cs.CL·July 30, 2025

The Problem with Safety Classification is not just the Models

Sowmya Vajjala

PDF

TL;DR

This paper highlights the limitations of current multilingual safety classifiers for LLMs, revealing disparities across languages and issues with evaluation datasets, emphasizing the need for improved safety assessment methods.

Contribution

It uncovers multilingual disparities and dataset shortcomings in safety classifiers, challenging the assumption that model flaws are the sole cause of safety issues.

Findings

01

Multilingual disparities exist in safety classification models across 18 languages.

02

Evaluation datasets have significant shortcomings affecting safety classifier assessments.

03

Current safety classifiers are not solely responsible for safety shortcomings, dataset issues also play a role.

Abstract

Studying the robustness of Large Language Models (LLMs) to unsafe behaviors is an important topic of research today. Building safety classification models or guard models, which are fine-tuned models for input/output safety classification for LLMs, is seen as one of the solutions to address the issue. Although there is a lot of research on the safety testing of LLMs themselves, there is little research on evaluating the effectiveness of such safety classifiers or the evaluation datasets used for testing them, especially in multilingual scenarios. In this position paper, we demonstrate how multilingual disparities exist in 5 safety classification models by considering datasets covering 18 languages. At the same time, we identify potential issues with the evaluation datasets, arguing that the shortcomings of current safety classifiers are not only because of the models themselves. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.