RankArena: A Unified Platform for Evaluating Retrieval, Reranking and RAG with Human and LLM Feedback
Abdelrahman Abdallah, Mahmoud Abdalla, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, Adam Jatowt

TL;DR
RankArena is a comprehensive platform that enables multi-faceted evaluation of retrieval, reranking, and RAG systems using human and LLM feedback, facilitating better analysis and training of retrieval models.
Contribution
It introduces a unified, scalable platform supporting diverse evaluation modes and feedback collection for retrieval and RAG systems, integrating human and LLM judgments.
Findings
Supports multiple evaluation modes including visualisation and pairwise comparisons.
Captures detailed relevance feedback with auxiliary metadata.
Enables comparison between model rankings and human annotations.
Abstract
Evaluating the quality of retrieval-augmented generation (RAG) and document reranking systems remains challenging due to the lack of scalable, user-centric, and multi-perspective evaluation tools. We introduce RankArena, a unified platform for comparing and analysing the performance of retrieval pipelines, rerankers, and RAG systems using structured human and LLM-based feedback as well as for collecting such feedback. RankArena supports multiple evaluation modes: direct reranking visualisation, blind pairwise comparisons with human or LLM voting, supervised manual document annotation, and end-to-end RAG answer quality assessment. It captures fine-grained relevance feedback through both pairwise preferences and full-list annotations, along with auxiliary metadata such as movement metrics, annotation time, and quality ratings. The platform also integrates LLM-as-a-judge evaluation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
