RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation
Lorenz Brehme, Benedikt Dornauer, Jan-Henrik B\"ottcher, Klaus Schmid, Mircea-Cristian Racasan, Ruth Breu

TL;DR
RAG-DIVE introduces a dynamic, multi-turn dialogue evaluation framework for Retrieval-Augmented Generation systems, simulating user interactions with an LLM to better assess real-world performance.
Contribution
It presents a novel interactive evaluation method that dynamically generates and assesses multi-turn conversations, overcoming static dataset limitations.
Findings
RAG-DIVE effectively detects performance changes caused by system modifications.
It correlates well with traditional static evaluations in revealing performance trends.
The approach provides detailed per-turn and multi-turn metrics for comprehensive assessment.
Abstract
Evaluating Retrieval-Augmented Generation (RAG) systems using static multi-turn datasets fails to capture the dynamic nature of real-world dialogues. Existing evaluation methods rely on predefined datasets, which restrict them to static, one-directional queries and limit their ability to capture the adaptive, context-dependent performance of RAG systems in interactive, multi-turn settings. Thus, we introduce the RAG-DIVE, a Dynamic Interactive Validation and Evaluation approach, that simulates user interactions with RAG systems. RAG-DIVE leverages an LLM to generate multi-turn conversations dynamically and is organized into three components. The dialogue generation stage consists of the (1) Conversation Generator, which simulates a user by creating multi-turn queries, and the (2) Conversation Validator, which filters and corrects invalid or low-quality outputs to ensure coherent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
