RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
Zhuowen Yuan, Zidi Xiong, Yi Zeng, Ning Yu, Ruoxi Jia, Dawn Song, Bo, Li

TL;DR
RigorLLM introduces a resilient framework for moderating harmful content in large language models, employing energy-based data augmentation, minimax optimization, and fusion techniques to enhance safety and robustness against adversarial attacks.
Contribution
The paper presents a novel multi-faceted framework that significantly improves harmful content detection and resilience in LLMs compared to existing methods.
Findings
Outperforms OpenAI API and Perspective API in harmful content detection
Exhibits high resilience to jailbreaking attacks
Uses a fusion-based model combining KNN and LLMs for robust moderation
Abstract
Recent advancements in Large Language Models (LLMs) have showcased remarkable capabilities across various tasks in different domains. However, the emergence of biases and the potential for generating harmful content in LLMs, particularly under malicious inputs, pose significant challenges. Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently and effectively moderate harmful and unsafe inputs and outputs for LLMs. By employing a multi-faceted approach that includes energy-based training data augmentation through Langevin dynamics, optimizing a safe suffix for inputs via minimax optimization, and integrating a fusion-based model combining robust KNN with LLMs based on our data augmentation, RigorLLM offers a robust solution to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗meta-llama/Meta-Llama-Guard-2-8Bmodel· 7.1k dl· ♡ 3077.1k dl♡ 307
- 🤗QuantFactory/Meta-Llama-Guard-2-8B-GGUFmodel· 300 dl· ♡ 12300 dl♡ 12
- 🤗RichardErkhov/meta-llama_-_Meta-Llama-Guard-2-8B-4bitsmodel· 5 dl5 dl
- 🤗RichardErkhov/meta-llama_-_Meta-Llama-Guard-2-8B-8bitsmodel· 8 dl8 dl
- 🤗Efficient-Large-Model/Meta-Llama-Guard-2-8Bmodel· 3 dl3 dl
- 🤗tybrs/llama-guard-quantmodel· 228 dl228 dl
- 🤗LiteLLMs/Meta-Llama-Guard-2-8B-GGUFmodel· 5 dl5 dl
- 🤗RichardErkhov/meta-llama_-_Meta-Llama-Guard-2-8B-ggufmodel· 1.5k dl1.5k dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Adversarial Robustness in Machine Learning
