Leveraging RAG for Training-Free Alignment of LLMs
John T. Halloran

TL;DR
This paper introduces RAG-Pref, a training-free, retrieval-augmented alignment method for LLMs that improves refusal guardrails and human-preference alignment with minimal additional computational cost.
Contribution
The paper presents RAG-Pref, a simple online alignment algorithm that leverages contrastive information via retrieval, enhancing safety and preference alignment without retraining.
Findings
RAG-Pref improves agentic attack refusals by an average of 3.7 times across five LLMs.
RAG-Pref increases general human-preference alignment performance.
RAG-Pref does not significantly increase computational requirements.
Abstract
Large language model (LLM) alignment algorithms typically consist of post-training over preference pairs. While such algorithms are widely used to enable safety guardrails and align LLMs with general human preferences, we show that state-of-the-art alignment algorithms require significant computational resources while being far less capable of enabling refusal guardrails for recent agentic attacks. Thus, to improve refusal guardrails against such attacks without drastically increasing computational overhead, we introduce Retrieval Augmented Generation for Pref erence alignment (RAG-Pref), a simple RAG-based alignment algorithm which conditions on preferred and dispreferred samples to leverage contrastive information during inference. RAG-Pref is online (training-free), compatible with off-the-shelf packages, and, when combined with offline (training-based) alignment algorithms, enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
