Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge

Chase Fensore; Kaustubh Dhole; Joyce C Ho; Eugene Agichtein

arXiv:2506.22644·cs.CL·July 1, 2025

Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge

Chase Fensore, Kaustubh Dhole, Joyce C Ho, Eugene Agichtein

PDF

Open Access

TL;DR

This paper evaluates a hybrid retrieval-augmented generation system on dynamic test sets, combining sparse and dense retrieval methods, and analyzes the impact of re-ranking, prompting strategies, and vocabulary alignment on performance.

Contribution

It introduces a hybrid retrieval approach for RAG systems, assesses re-ranking and prompting strategies, and identifies vocabulary alignment as a key performance predictor.

Findings

01

Neural re-ranking significantly improves MAP but is computationally expensive.

02

DSPy prompting increases semantic similarity but has over-confidence issues.

03

Vocabulary alignment correlates strongly with system performance.

Abstract

We present our submission to the LiveRAG Challenge 2025, which evaluates retrieval-augmented generation (RAG) systems on dynamic test sets using the FineWeb-10BT corpus. Our final hybrid approach combines sparse (BM25) and dense (E5) retrieval methods and then aims to generate relevant and faithful answers with Falcon3-10B-Instruct. Through systematic evaluation on 200 synthetic questions generated with DataMorgana across 64 unique question-user combinations, we demonstrate that neural re-ranking with RankLLaMA improves MAP from 0.523 to 0.797 (52% relative improvement) but introduces prohibitive computational costs (84s vs 1.74s per question). While DSPy-optimized prompting strategies achieved higher semantic similarity (0.771 vs 0.668), their 0% refusal rates raised concerns about over-confidence and generalizability. Our submitted hybrid system without re-ranking achieved 4th place…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Multimodal Machine Learning Applications