The Problem with Safety Classification is not just the Models
Sowmya Vajjala

TL;DR
This paper highlights the limitations of current multilingual safety classifiers for LLMs, revealing disparities across languages and issues with evaluation datasets, emphasizing the need for improved safety assessment methods.
Contribution
It uncovers multilingual disparities and dataset shortcomings in safety classifiers, challenging the assumption that model flaws are the sole cause of safety issues.
Findings
Multilingual disparities exist in safety classification models across 18 languages.
Evaluation datasets have significant shortcomings affecting safety classifier assessments.
Current safety classifiers are not solely responsible for safety shortcomings, dataset issues also play a role.
Abstract
Studying the robustness of Large Language Models (LLMs) to unsafe behaviors is an important topic of research today. Building safety classification models or guard models, which are fine-tuned models for input/output safety classification for LLMs, is seen as one of the solutions to address the issue. Although there is a lot of research on the safety testing of LLMs themselves, there is little research on evaluating the effectiveness of such safety classifiers or the evaluation datasets used for testing them, especially in multilingual scenarios. In this position paper, we demonstrate how multilingual disparities exist in 5 safety classification models by considering datasets covering 18 languages. At the same time, we identify potential issues with the evaluation datasets, arguing that the shortcomings of current safety classifiers are not only because of the models themselves. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
