A Methodology for Evaluating RAG Systems: A Case Study On Configuration Dependency Validation
Sebastian Simon, Alina Mailach, Johannes Dorn, and Norbert Siegmund

TL;DR
This paper introduces a systematic methodology for evaluating retrieval-augmented generation (RAG) systems, demonstrated through a case study on dependency validation in software engineering, leading to a highly accurate RAG system.
Contribution
It presents a novel, reusable evaluation methodology for RAG systems and develops a RAG system following this methodology that achieves top accuracy in dependency validation.
Findings
Proper baseline and metric selection are crucial.
Systematic refinement improves RAG performance.
Reporting design decisions enhances reproducibility.
Abstract
Retrieval-augmented generation (RAG) is an umbrella of different components, design decisions, and domain-specific adaptations to enhance the capabilities of large language models and counter their limitations regarding hallucination and outdated and missing knowledge. Since it is unclear which design decisions lead to a satisfactory performance, developing RAG systems is often experimental and needs to follow a systematic and sound methodology to gain sound and reliable results. However, there is currently no generally accepted methodology for RAG evaluation despite a growing interest in this technology. In this paper, we propose a first blueprint of a methodology for a sound and reliable evaluation of RAG systems and demonstrate its applicability on a real-world software engineering research task: the validation of configuration dependencies across software technologies. In summary,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Advanced Software Engineering Methodologies · Software System Performance and Reliability
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Linear Layer · Weight Decay · WordPiece · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Byte Pair Encoding · BERT
