RigorLLM: Resilient Guardrails for Large Language Models against   Undesired Content

Zhuowen Yuan; Zidi Xiong; Yi Zeng; Ning Yu; Ruoxi Jia; Dawn Song; Bo; Li

arXiv:2403.13031·cs.CR·July 25, 2024·2 cites

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

Zhuowen Yuan, Zidi Xiong, Yi Zeng, Ning Yu, Ruoxi Jia, Dawn Song, Bo, Li

PDF

Open Access 1 Repo 8 Models

TL;DR

RigorLLM introduces a resilient framework for moderating harmful content in large language models, employing energy-based data augmentation, minimax optimization, and fusion techniques to enhance safety and robustness against adversarial attacks.

Contribution

The paper presents a novel multi-faceted framework that significantly improves harmful content detection and resilience in LLMs compared to existing methods.

Findings

01

Outperforms OpenAI API and Perspective API in harmful content detection

02

Exhibits high resilience to jailbreaking attacks

03

Uses a fusion-based model combining KNN and LLMs for robust moderation

Abstract

Recent advancements in Large Language Models (LLMs) have showcased remarkable capabilities across various tasks in different domains. However, the emergence of biases and the potential for generating harmful content in LLMs, particularly under malicious inputs, pose significant challenges. Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently and effectively moderate harmful and unsafe inputs and outputs for LLMs. By employing a multi-faceted approach that includes energy-based training data augmentation through Langevin dynamics, optimizing a safe suffix for inputs via minimax optimization, and integrating a fusion-based model combining robust KNN with LLMs based on our data augmentation, RigorLLM offers a robust solution to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eurekayuan/rigorllm
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Adversarial Robustness in Machine Learning