Revisiting the Robust Alignment of Circuit Breakers

Leo Schwinn; Simon Geisler

arXiv:2407.15902·cs.CR·August 5, 2024

Revisiting the Robust Alignment of Circuit Breakers

Leo Schwinn, Simon Geisler

PDF

Open Access 1 Repo

TL;DR

This paper critically evaluates the robustness of circuit breaker defenses in aligning large language models, revealing that their effectiveness against embedding space attacks is significantly overestimated in prior claims.

Contribution

It demonstrates that simple modifications to attack methods can completely bypass circuit breaker defenses, challenging previous robustness claims and highlighting the need for more rigorous evaluation.

Findings

01

Achieved 100% attack success rate against circuit breaker models.

02

Increased attack effectiveness by over 80% without hyperparameter tuning.

03

Questioned the robustness claims of circuit breakers in LLM alignment.

Abstract

Over the past decade, adversarial training has emerged as one of the few reliable methods for enhancing model robustness against adversarial attacks [Szegedy et al., 2014, Madry et al., 2018, Xhonneux et al., 2024], while many alternative approaches have failed to withstand rigorous subsequent evaluations. Recently, an alternative defense mechanism, namely "circuit breakers" [Zou et al., 2024], has shown promising results for aligning LLMs. In this report, we show that the robustness claims of "Improving Alignment and Robustness with Circuit Breakers" against unconstraint continuous attacks in the embedding space of the input tokens may be overestimated [Zou et al., 2024]. Specifically, we demonstrate that by implementing a few simple changes to embedding space attacks [Schwinn et al., 2024a,b], we achieve 100% attack success rate (ASR) against circuit breaker models. Without conducting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

schwinnl/circuit-breakers-eval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPower System Reliability and Maintenance · Vibration and Dynamic Analysis