Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework
Matthew Pisano, Peter Ly, Abraham Sanders, Bingsheng Yao, Dakuo Wang,, Tomek Strzalkowski, Mei Si

TL;DR
Bergeron is a novel framework that enhances the robustness of large language models against adversarial attacks by employing a guardian model, significantly reducing harmful outputs without additional fine-tuning.
Contribution
The paper introduces Bergeron, a two-tier alignment framework with a guardian model that improves LLM safety and robustness against attacks without extra parameter tuning.
Findings
Models with Bergeron are nearly seven times more resistant to attacks.
Bergeron effectively monitors and mitigates harmful outputs in LLMs.
The framework enhances safety of both commercial and open-source LLMs.
Abstract
Research into AI alignment has grown considerably since the recent introduction of increasingly capable Large Language Models (LLMs). Unfortunately, modern methods of alignment still fail to fully prevent harmful responses when models are deliberately attacked. Such vulnerabilities can lead to LLMs being manipulated into generating hazardous content: from instructions for creating dangerous materials to inciting violence or endorsing unethical behaviors. To help mitigate this issue, we introduce Bergeron: a framework designed to improve the robustness of LLMs against attacks without any additional parameter fine-tuning. Bergeron is organized into two tiers; with a secondary LLM acting as a guardian to the primary LLM. This framework better safeguards the primary model against incoming attacks while monitoring its output for any harmful content. Empirical analysis reviews that by using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques
