Improving Alignment and Robustness with Circuit Breakers
Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym, Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks

TL;DR
This paper introduces 'circuit breakers', a novel method that intervenes in AI models' representations to prevent harmful outputs and improve robustness against adversarial attacks across text, multimodal systems, and AI agents.
Contribution
The paper presents a new representation-based intervention technique called circuit breakers that enhances AI safety and robustness without reducing utility, applicable to various AI modalities and agents.
Findings
Circuit breakers effectively prevent harmful outputs in text and multimodal models.
The approach maintains model utility even under unseen attacks.
Significant reduction in harmful actions in AI agents under attack.
Abstract
AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that interrupts the models as they respond with harmful outputs with "circuit breakers." Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, circuit-breaking directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility -- even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗GraySwanAI/Llama-3-8B-Instruct-RRmodel· 17k dl· ♡ 1517k dl♡ 15
- 🤗GraySwanAI/Mistral-7B-Instruct-RRmodel· 1.1k dl· ♡ 51.1k dl♡ 5
- 🤗RichardErkhov/GraySwanAI_-_Llama-3-8B-Instruct-RR-ggufmodel· 136 dl136 dl
- 🤗RichardErkhov/GraySwanAI_-_Mistral-7B-Instruct-RR-ggufmodel· 38 dl38 dl
- 🤗GraySwanAI/llava-v1.6-mistral-7b-hf-RRmodel· 418 dl· ♡ 1418 dl♡ 1
- 🤗RichardErkhov/GraySwanAI_-_Mistral-7B-Instruct-RR-8bitsmodel· 1 dl1 dl
- 🤗RichardErkhov/GraySwanAI_-_Llama-3-8B-Instruct-RR-8bitsmodel· 1 dl1 dl
- 🤗RichardErkhov/GraySwanAI_-_Llama-3-8B-Instruct-RR-awqmodel· 1 dl1 dl
- 🤗WhyTheMoon/Llama-3-8B-Instruct_RR_Textbook-Biomodel
- 🤗WhyTheMoon/Llama-3-8B-Instruct_RR_Filter-Biomodel· 1 dl1 dl
Videos
Taxonomy
TopicsVLSI and FPGA Design Techniques
