RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

Joshua Gao; Quoc Huy Pham; Subin Varghese; Silwal Saurav; Vedhus Hoskere

arXiv:2511.04502·cs.CL·November 7, 2025

RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

Joshua Gao, Quoc Huy Pham, Subin Varghese, Silwal Saurav, Vedhus Hoskere

PDF

Open Access 1 Datasets

TL;DR

RAGalyst is an automated framework that aligns human judgment with LLM-based evaluation to assess domain-specific RAG systems, addressing the challenge of evaluating factual accuracy in safety-critical fields.

Contribution

It introduces a novel agentic pipeline for generating synthetic QA datasets and refines LLM-as-a-Judge metrics to better match human evaluations in specialized domains.

Findings

01

Performance varies significantly across domains and configurations.

02

No single model or hyperparameter setting is universally optimal.

03

Analysis reveals common reasons for low Answer Correctness.

Abstract

Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics-Answer Correctness and Answerability-using prompt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

hoskerelab/ragalyst-qac
dataset· 75 dl
75 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Expert finding and Q&A systems