Improving Alignment and Robustness with Circuit Breakers

Andy Zou; Long Phan; Justin Wang; Derek Duenas; Maxwell Lin; Maksym; Andriushchenko; Rowan Wang; Zico Kolter; Matt Fredrikson; Dan Hendrycks

arXiv:2406.04313·cs.LG·July 15, 2024·6 cites

Improving Alignment and Robustness with Circuit Breakers

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym, Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks

PDF

Open Access 4 Repos 10 Models 1 Video

TL;DR

This paper introduces 'circuit breakers', a novel method that intervenes in AI models' representations to prevent harmful outputs and improve robustness against adversarial attacks across text, multimodal systems, and AI agents.

Contribution

The paper presents a new representation-based intervention technique called circuit breakers that enhances AI safety and robustness without reducing utility, applicable to various AI modalities and agents.

Findings

01

Circuit breakers effectively prevent harmful outputs in text and multimodal models.

02

The approach maintains model utility even under unseen attacks.

03

Significant reduction in harmful actions in AI agents under attack.

Abstract

AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that interrupts the models as they respond with harmful outputs with "circuit breakers." Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, circuit-breaking directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility -- even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Improving Alignment and Robustness with Circuit Breakers· slideslive

Taxonomy

TopicsVLSI and FPGA Design Techniques