DICE: Discrete Interpretable Comparative Evaluation with Probabilistic Scoring for Retrieval-Augmented Generation
Shiyan Liu, Jian Ma, Rui Qu

TL;DR
DICE introduces an explainable, robust, and efficient evaluation framework for RAG systems using probabilistic scoring and a tournament approach, improving interpretability and reducing computational costs.
Contribution
The paper presents DICE, a novel evaluation method combining probabilistic scoring and a tournament strategy to enhance explainability and efficiency in RAG system assessment.
Findings
Achieves 85.7% agreement with human judgments.
Reduces evaluation complexity by 42.9%.
Outperforms existing metrics like RAGAS.
Abstract
As Retrieval-Augmented Generation (RAG) systems evolve toward more sophisticated architectures, ensuring their trustworthiness through explainable and robust evaluation becomes critical. Existing scalar metrics suffer from limited interpretability, inadequate uncertainty quantification, and computational inefficiency in multi-system comparisons, hindering responsible deployment of RAG technologies. We introduce DICE (Discrete Interpretable Comparative Evaluation), a two-stage, evidence-coupled framework that advances explainability and robustness in RAG evaluation. DICE combines deep analytical reasoning with probabilistic scoring to produce transparent, confidence-aware judgments that support accountable system improvement through interpretable reasoning traces, enabling systematic error diagnosis and actionable insights. To address efficiency challenges at scale, DICE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Explainable Artificial Intelligence (XAI) · Machine Learning in Materials Science
