Highlight & Summarize: RAG without the jailbreaks
Giovanni Cherubin, Andrew Paverd

TL;DR
Highlight & Summarize (H&S) is a novel retrieval-augmented generation approach that enhances security against jailbreaks by never revealing user questions to the LLM, thus preventing malicious prompt injections.
Contribution
This paper introduces H&S, a new RAG design pattern that prevents jailbreaks by separating question highlighting and answer summarization, avoiding direct question exposure to the LLM.
Findings
H&S achieves comparable or better answer quality than standard RAG.
H&S effectively prevents jailbreak attacks by design.
Evaluations show H&S maintains high relevance and correctness.
Abstract
Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. When interacting with a chatbot, malicious users can input specially crafted prompts that cause the LLM to generate undesirable content or perform a different task from its intended purpose. Existing systems attempt to mitigate this by hardening the LLM's system prompt or using additional classifiers to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. We present and evaluate Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
