Bergeron: Combating Adversarial Attacks through a Conscience-Based   Alignment Framework

Matthew Pisano; Peter Ly; Abraham Sanders; Bingsheng Yao; Dakuo Wang,; Tomek Strzalkowski; Mei Si

arXiv:2312.00029·cs.CR·August 20, 2024·5 cites

Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework

Matthew Pisano, Peter Ly, Abraham Sanders, Bingsheng Yao, Dakuo Wang,, Tomek Strzalkowski, Mei Si

PDF

Open Access 1 Repo

TL;DR

Bergeron is a novel framework that enhances the robustness of large language models against adversarial attacks by employing a guardian model, significantly reducing harmful outputs without additional fine-tuning.

Contribution

The paper introduces Bergeron, a two-tier alignment framework with a guardian model that improves LLM safety and robustness against attacks without extra parameter tuning.

Findings

01

Models with Bergeron are nearly seven times more resistant to attacks.

02

Bergeron effectively monitors and mitigates harmful outputs in LLMs.

03

The framework enhances safety of both commercial and open-source LLMs.

Abstract

Research into AI alignment has grown considerably since the recent introduction of increasingly capable Large Language Models (LLMs). Unfortunately, modern methods of alignment still fail to fully prevent harmful responses when models are deliberately attacked. Such vulnerabilities can lead to LLMs being manipulated into generating hazardous content: from instructions for creating dangerous materials to inciting violence or endorsing unethical behaviors. To help mitigate this issue, we introduce Bergeron: a framework designed to improve the robustness of LLMs against attacks without any additional parameter fine-tuning. Bergeron is organized into two tiers; with a secondary LLM acting as a guardian to the primary LLM. This framework better safeguards the primary model against incoming attacks while monitoring its output for any harmful content. Empirical analysis reviews that by using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

matthew-pisano/Bergeron
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques