TopClustRAG at SIGIR 2025 LiveRAG Challenge
Juli Bakagianni, John Pavlopoulos, Aristidis Likas

TL;DR
TopClustRAG is a hybrid retrieval-augmented generation system that uses clustering and multi-stage filtering to improve answer quality in large-scale web question answering tasks.
Contribution
It introduces a novel clustering-based approach for context filtering and prompt aggregation in RAG systems, enhancing answer relevance and faithfulness.
Findings
Ranked 2nd in faithfulness on the leaderboard
Achieved 7th in correctness on the leaderboard
Demonstrated effectiveness of clustering in large-scale RAG
Abstract
We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora. Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence. Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging Techniques and Applications · Medical Image Segmentation Techniques · Radiomics and Machine Learning in Medical Imaging
