Benchmarking LLM Guardrails in Handling Multilingual Toxicity

Yahan Yang; Soham Dan; Dan Roth; Insup Lee

arXiv:2410.22153·cs.CL·October 30, 2024

Benchmarking LLM Guardrails in Handling Multilingual Toxicity

Yahan Yang, Soham Dan, Dan Roth, Insup Lee

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of current guardrails in Large Language Models for detecting and preventing multilingual toxicity, revealing significant limitations and robustness issues across diverse languages and attack methods.

Contribution

Introduces a comprehensive multilingual benchmark for LLM guardrails and assesses their robustness against jailbreaks and language resource variability.

Findings

01

Guardrails are ineffective in multilingual toxicity detection.

02

Guardrails lack robustness against jailbreak prompts.

03

Performance varies with language resource availability.

Abstract

With the ubiquity of Large Language Models (LLMs), guardrails have become crucial to detect and defend against toxic content. However, with the increasing pervasiveness of LLMs in multilingual scenarios, their effectiveness in handling multilingual toxic inputs remains unclear. In this work, we introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We also investigates the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance. Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts. This work aims to identify the limitations of guardrails and to build a more reliable and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques