EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation

Adam Dejl; Jonathan Pearson

arXiv:2602.18823·cs.CL·May 8, 2026

EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation

Adam Dejl, Jonathan Pearson

PDF

1 Repo 1 Video

TL;DR

EvalSense is a flexible framework designed to improve the evaluation of large language models by providing domain-specific tools, an interactive guide, and automated reliability assessment, demonstrated through a clinical note generation case study.

Contribution

It introduces a comprehensive, extensible evaluation framework with tools for method selection and reliability assessment, addressing challenges in domain-specific LLM evaluation.

Findings

01

EvalSense effectively supports diverse evaluation strategies.

02

Automated meta-evaluation assesses reliability of evaluation approaches.

03

Case study demonstrates practical utility in clinical NLP tasks.

Abstract

Robust and comprehensive evaluation of large language models (LLMs) is essential for identifying effective LLM system configurations and mitigating risks associated with deploying LLMs in sensitive domains. However, traditional statistical metrics are poorly suited to open-ended generation tasks, leading to growing reliance on LLM-based evaluation methods. These methods, while often more flexible, introduce additional complexity: they depend on carefully chosen models, prompts, parameters, and evaluation strategies, making the evaluation process prone to misconfiguration and bias. In this work, we present EvalSense, a flexible, extensible framework for constructing domain-specific evaluation suites for LLMs. EvalSense provides out-of-the-box support for a broad range of model providers and evaluation strategies, and assists users in selecting and deploying suitable evaluation methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nhsengland/evalsense
github

Videos

EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation· underline