RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG
Joshua Gao, Quoc Huy Pham, Subin Varghese, Silwal Saurav, Vedhus Hoskere

TL;DR
RAGalyst is an automated framework that aligns human judgment with LLM-based evaluation to assess domain-specific RAG systems, addressing the challenge of evaluating factual accuracy in safety-critical fields.
Contribution
It introduces a novel agentic pipeline for generating synthetic QA datasets and refines LLM-as-a-Judge metrics to better match human evaluations in specialized domains.
Findings
Performance varies significantly across domains and configurations.
No single model or hyperparameter setting is universally optimal.
Analysis reveals common reasons for low Answer Correctness.
Abstract
Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics-Answer Correctness and Answerability-using prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Expert finding and Q&A systems
