Class-RAG: Real-Time Content Moderation with Retrieval Augmented Generation
Jianfa Chen, Emily Shen, Trupti Bavalatti, Xiaowen Lin, Yongkai Wang,, Shuming Hu, Harihar Subramanyam, Ksheeraj Sai Vepuri, Ming Jiang, Ji Qi, Li, Chen, Nan Jiang, Ankit Jain

TL;DR
Class-RAG introduces a retrieval-augmented generation approach for real-time content moderation, offering flexible, transparent, and scalable risk mitigation that outperforms traditional fine-tuning methods and adapts quickly to emergent harms.
Contribution
The paper presents a novel retrieval-augmented classification method for content moderation that enhances flexibility, transparency, and robustness over traditional fine-tuning approaches.
Findings
Class-RAG outperforms fine-tuning in classification accuracy.
It is more robust against adversarial attacks.
Performance improves with larger retrieval libraries.
Abstract
Robust content moderation classifiers are essential for the safety of Generative AI systems. In this task, differences between safe and unsafe inputs are often extremely subtle, making it difficult for classifiers (and indeed, even humans) to properly distinguish violating vs. benign samples without context or explanation. Scaling risk discovery and mitigation through continuous model fine-tuning is also slow, challenging and costly, preventing developers from being able to respond quickly and effectively to emergent harms. We propose a Classification approach employing Retrieval-Augmented Generation (Class-RAG). Class-RAG extends the capability of its base LLM through access to a retrieval library which can be dynamically updated to enable semantic hotfixing for immediate, flexible risk mitigation. Compared to model fine-tuning, Class-RAG demonstrates flexibility and transparency in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Natural Language Processing Techniques
MethodsBalanced Selection · Lib
