RAR: Setting Knowledge Tripwires for Retrieval Augmented Rejection
Tommaso Mario Buonocore, Enea Parimbelli

TL;DR
This paper presents Retrieval Augmented Rejection (RAR), a flexible method for content moderation in large language models that detects and rejects unsafe queries by leveraging retrieval-augmented generation without retraining the model.
Contribution
RAR introduces a novel retrieval-based approach for dynamic content moderation that requires no architectural changes and allows real-time customization.
Findings
RAR achieves performance comparable to embedded moderation.
It offers superior flexibility and real-time customization.
The method requires only adding malicious documents to the database.
Abstract
Content moderation for large language models (LLMs) remains a significant challenge, requiring flexible and adaptable solutions that can quickly respond to emerging threats. This paper introduces Retrieval Augmented Rejection (RAR), a novel approach that leverages a retrieval-augmented generation (RAG) architecture to dynamically reject unsafe user queries without model retraining. By strategically inserting and marking malicious documents into the vector database, the system can identify and reject harmful requests when these documents are retrieved. Our preliminary results show that RAR achieves comparable performance to embedded moderation in LLMs like Claude 3.5 Sonnet, while offering superior flexibility and real-time customization capabilities, a fundamental feature to timely address critical vulnerabilities. This approach introduces no architectural changes to existing RAG…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management
MethodsAttention Is All You Need · Linear Warmup With Linear Decay · Softmax · Attention Dropout · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Byte Pair Encoding · Weight Decay
