XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content
Vadivel Abishethvarman, Bhavik Chandna, Pratik Jalan, Usman Naseem

TL;DR
XGUARD is a comprehensive benchmark that evaluates the severity of extremist content generated by large language models, providing nuanced safety assessments beyond binary safety labels.
Contribution
We introduce XGUARD, a graded safety evaluation framework with 3,840 prompts and the Attack Severity Curve for detailed analysis of LLM safety failures.
Findings
Identified safety gaps in six popular LLMs.
Showed trade-offs between robustness and expressive freedom.
Demonstrated effectiveness of lightweight defense strategies.
Abstract
Large Language Models (LLMs) can generate content spanning ideological rhetoric to explicit instructions for violence. However, existing safety evaluations often rely on simplistic binary labels (safe and unsafe), overlooking the nuanced spectrum of risk these outputs pose. To address this, we present XGUARD, a benchmark and evaluation framework designed to assess the severity of extremist content generated by LLMs. XGUARD includes 3,840 red teaming prompts sourced from real world data such as social media and news, covering a broad range of ideologically charged scenarios. Our framework categorizes model responses into five danger levels (0 to 4), enabling a more nuanced analysis of both the frequency and severity of failures. We introduce the interpretable Attack Severity Curve (ASC) to visualize vulnerabilities and compare defense mechanisms across threat intensities. Using XGUARD,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification
