AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM   Experts

Shaona Ghosh; Prasoon Varshney; Erick Galinkin; Christopher Parisien

arXiv:2404.05993·cs.LG·September 12, 2024·3 cites

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

Shaona Ghosh, Prasoon Varshney, Erick Galinkin, Christopher Parisien

PDF

Open Access 3 Models 1 Datasets

TL;DR

This paper introduces AEGIS, an online adaptive content moderation system utilizing an ensemble of LLM experts, supported by a new safety dataset, to improve safety performance and robustness in AI-generated content.

Contribution

It defines a comprehensive safety taxonomy, curates a large safety dataset, and develops an adaptive ensemble moderation framework with theoretical guarantees.

Findings

01

AEGISSAFETYEXPERTS outperform state-of-the-art safety models

02

The dataset enhances safety model training without harming MT Bench scores

03

The online adaptation framework improves robustness against jailbreak attacks

Abstract

As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase. We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas. To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories. Additionally, we curate AEGISSAFETYDATASET, a new dataset of approximately 26, 000 human-LLM interaction instances, complete with human annotations adhering to the taxonomy. We plan to release this dataset to the community to further research and to help benchmark LLM models for safety. To demonstrate the effectiveness of the dataset, we instruction-tune multiple LLM-based safety models. We show that our models (named AEGISSAFETYEXPERTS), not only surpass or perform competitively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

nvidia/Aegis-AI-Content-Safety-Dataset-1.0
dataset· 868 dl
868 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Web Application Security Vulnerabilities