Class-RAG: Real-Time Content Moderation with Retrieval Augmented   Generation

Jianfa Chen; Emily Shen; Trupti Bavalatti; Xiaowen Lin; Yongkai Wang,; Shuming Hu; Harihar Subramanyam; Ksheeraj Sai Vepuri; Ming Jiang; Ji Qi; Li; Chen; Nan Jiang; Ankit Jain

arXiv:2410.14881·cs.AI·December 19, 2024

Class-RAG: Real-Time Content Moderation with Retrieval Augmented Generation

Jianfa Chen, Emily Shen, Trupti Bavalatti, Xiaowen Lin, Yongkai Wang,, Shuming Hu, Harihar Subramanyam, Ksheeraj Sai Vepuri, Ming Jiang, Ji Qi, Li, Chen, Nan Jiang, Ankit Jain

PDF

Open Access

TL;DR

Class-RAG introduces a retrieval-augmented generation approach for real-time content moderation, offering flexible, transparent, and scalable risk mitigation that outperforms traditional fine-tuning methods and adapts quickly to emergent harms.

Contribution

The paper presents a novel retrieval-augmented classification method for content moderation that enhances flexibility, transparency, and robustness over traditional fine-tuning approaches.

Findings

01

Class-RAG outperforms fine-tuning in classification accuracy.

02

It is more robust against adversarial attacks.

03

Performance improves with larger retrieval libraries.

Abstract

Robust content moderation classifiers are essential for the safety of Generative AI systems. In this task, differences between safe and unsafe inputs are often extremely subtle, making it difficult for classifiers (and indeed, even humans) to properly distinguish violating vs. benign samples without context or explanation. Scaling risk discovery and mitigation through continuous model fine-tuning is also slow, challenging and costly, preventing developers from being able to respond quickly and effectively to emergent harms. We propose a Classification approach employing Retrieval-Augmented Generation (Class-RAG). Class-RAG extends the capability of its base LLM through access to a retrieval library which can be dynamically updated to enable semantic hotfixing for immediate, flexible risk mitigation. Compared to model fine-tuning, Class-RAG demonstrates flexibility and transparency in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Natural Language Processing Techniques

MethodsBalanced Selection · Lib