HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment
Yannis Belkhiter, Giulio Zizzo, Sergio Maffeis

TL;DR
This paper introduces a new harm-level dataset and benchmark for evaluating LLM vulnerabilities, analyzes the impact of quantization on model robustness, and provides insights into safeguarding against harmful outputs.
Contribution
It presents a novel harm-level assessment dataset, benchmarks jailbreaking attacks on Vicuna 13B, and investigates how quantization affects model alignment and robustness.
Findings
Quantization techniques influence model robustness and vulnerability.
Harmful input queries increase jailbreaking complexity.
Trade-offs exist between robustness and vulnerability with quantization.
Abstract
With the introduction of the transformers architecture, LLMs have revolutionized the NLP field with ever more powerful models. Nevertheless, their development came up with several challenges. The exponential growth in computational power and reasoning capabilities of language models has heightened concerns about their security. As models become more powerful, ensuring their safety has become a crucial focus in research. This paper aims to address gaps in the current literature on jailbreaking techniques and the evaluation of LLM vulnerabilities. Our contributions include the creation of a novel dataset designed to assess the harmfulness of model outputs across multiple harm levels, as well as a focus on fine-grained harm-level analysis. Using this framework, we provide a comprehensive benchmark of state-of-the-art jailbreaking attacks, specifically targeting the Vicuna 13B v1.5 model.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHIV, Drug Use, Sexual Risk
MethodsFocus
