Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Krishiv Agarwal; Ramneet Kaur; Colin Samplawski; Manoj Acharya; Anirban Roy; Daniel Elenius; Brian Matejek; Adam D. Cobb; Susmit Jha

arXiv:2604.20945·cs.CR·April 24, 2026

Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya, Anirban Roy, Daniel Elenius, Brian Matejek, Adam D. Cobb, Susmit Jha

PDF

TL;DR

This paper introduces interpretability-driven safety audits for state-of-the-art LLMs, revealing vulnerabilities and robustness differences through novel steering techniques and a systematic evaluation protocol.

Contribution

It presents a new interpretability-based approach for safety auditing of LLMs, including a two-stage grid search for activation steering and a comprehensive vulnerability assessment.

Findings

01

Llama-3 models are highly vulnerable, with up to 91% jailbroken responses.

02

GPT-oss-120B remains robust to interpretability-based attacks.

03

Model robustness varies significantly across different architectures and sizes.

Abstract

Effective safety auditing of large language models (LLMs) demands tools that go beyond black-box probing and systematically uncover vulnerabilities rooted in model internals. We present a comprehensive, interpretability-driven jailbreaking audit of eight SOTA open-source LLMs: Llama-3.1-8B, Llama-3.3-70B-4bt, GPT-oss- 20B, GPT-oss-120B, Qwen3-0.6B, Qwen3-32B, Phi4-3.8B, and Phi4-14B. Leveraging interpretability-based approaches -- Universal Steering (US) and Representation Engineering (RepE) -- we introduce an adaptive two-stage grid search algorithm to identify optimal activation-steering coefficients for unsafe behavioral concepts. Our evaluation, conducted on a curated set of harmful queries and a standardized LLM-based judging protocol, reveals stark contrasts in model robustness. The Llama-3 models are highly vulnerable, with up to 91\% (US) and 83\% (RepE) jailbroken responses on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.