Optimization of Retrieval-Augmented Generation Context with Outlier Detection
Vitaly Bulgakov

TL;DR
This paper proposes methods to improve retrieval-augmented generation by detecting outlier documents, reducing irrelevant information, and enhancing response quality in question-answering systems using embedding-based features.
Contribution
It introduces novel outlier detection techniques based on embedding distances to improve the relevance of retrieved documents for LLM responses.
Findings
Outlier detection improves answer relevance especially for complex questions.
Embedding-based features effectively identify irrelevant documents.
Enhanced retrieval quality reduces hallucinations in LLM outputs.
Abstract
In this paper, we focus on methods to reduce the size and improve the quality of the prompt context required for question-answering systems. Attempts to increase the number of retrieved chunked documents and thereby enlarge the context related to the query can significantly complicate the processing and decrease the performance of a Large Language Model (LLM) when generating responses to queries. It is well known that a large set of documents retrieved from a database in response to a query may contain irrelevant information, which often leads to hallucinations in the resulting answers. Our goal is to select the most semantically relevant documents, treating the discarded ones as outliers. We propose and evaluate several methods for identifying outliers by creating features that utilize the distances of embedding vectors, retrieved from the vector database, to both the centroid and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTarget Tracking and Data Fusion in Sensor Networks · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
MethodsSparse Evolutionary Training · Focus
