The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

Ann-Kathrin Dombrowski; Dillon Bowen; Adam Gleave; Chris Cundy

arXiv:2507.11544·cs.CY·July 17, 2025

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

Ann-Kathrin Dombrowski, Dillon Bowen, Adam Gleave, Chris Cundy

PDF

Open Access

TL;DR

This paper introduces the Safety Gap Toolkit, an open-source framework to evaluate the increase in dangerous capabilities of large language models when safeguards are removed, highlighting risks as models scale up.

Contribution

The paper presents a novel toolkit for estimating safety gaps in open-source LLMs and provides empirical analysis across different models and safeguard removal techniques.

Findings

01

Safety gap widens with model scale

02

Dangerous capabilities increase after safeguard removal

03

Effective safeguards significantly reduce risks

Abstract

Open-weight large language models (LLMs) unlock huge benefits in innovation, personalization, privacy, and democratization. However, their core advantage - modifiability - opens the door to systemic risks: bad actors can trivially subvert current safeguards, turning beneficial models into tools for harm. This leads to a 'safety gap': the difference in dangerous capabilities between a model with intact safeguards and one that has been stripped of those safeguards. We open-source a toolkit to estimate the safety gap for state-of-the-art open-weight models. As a case study, we evaluate biochemical and cyber capabilities, refusal rates, and generation quality of models from two families (Llama-3 and Qwen-2.5) across a range of parameter scales (0.5B to 405B) using different safeguard removal techniques. Our experiments reveal that the safety gap widens as model scale increases and effective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Software Reliability and Analysis Research · Software Testing and Debugging Techniques