The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models
Ann-Kathrin Dombrowski, Dillon Bowen, Adam Gleave, Chris Cundy

TL;DR
This paper introduces the Safety Gap Toolkit, an open-source framework to evaluate the increase in dangerous capabilities of large language models when safeguards are removed, highlighting risks as models scale up.
Contribution
The paper presents a novel toolkit for estimating safety gaps in open-source LLMs and provides empirical analysis across different models and safeguard removal techniques.
Findings
Safety gap widens with model scale
Dangerous capabilities increase after safeguard removal
Effective safeguards significantly reduce risks
Abstract
Open-weight large language models (LLMs) unlock huge benefits in innovation, personalization, privacy, and democratization. However, their core advantage - modifiability - opens the door to systemic risks: bad actors can trivially subvert current safeguards, turning beneficial models into tools for harm. This leads to a 'safety gap': the difference in dangerous capabilities between a model with intact safeguards and one that has been stripped of those safeguards. We open-source a toolkit to estimate the safety gap for state-of-the-art open-weight models. As a case study, we evaluate biochemical and cyber capabilities, refusal rates, and generation quality of models from two families (Llama-3 and Qwen-2.5) across a range of parameter scales (0.5B to 405B) using different safeguard removal techniques. Our experiments reveal that the safety gap widens as model scale increases and effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSafety Systems Engineering in Autonomy · Software Reliability and Analysis Research · Software Testing and Debugging Techniques
