TL;DR
EvalSense is a flexible framework designed to improve the evaluation of large language models by providing domain-specific tools, an interactive guide, and automated reliability assessment, demonstrated through a clinical note generation case study.
Contribution
It introduces a comprehensive, extensible evaluation framework with tools for method selection and reliability assessment, addressing challenges in domain-specific LLM evaluation.
Findings
EvalSense effectively supports diverse evaluation strategies.
Automated meta-evaluation assesses reliability of evaluation approaches.
Case study demonstrates practical utility in clinical NLP tasks.
Abstract
Robust and comprehensive evaluation of large language models (LLMs) is essential for identifying effective LLM system configurations and mitigating risks associated with deploying LLMs in sensitive domains. However, traditional statistical metrics are poorly suited to open-ended generation tasks, leading to growing reliance on LLM-based evaluation methods. These methods, while often more flexible, introduce additional complexity: they depend on carefully chosen models, prompts, parameters, and evaluation strategies, making the evaluation process prone to misconfiguration and bias. In this work, we present EvalSense, a flexible, extensible framework for constructing domain-specific evaluation suites for LLMs. EvalSense provides out-of-the-box support for a broad range of model providers and evaluation strategies, and assists users in selecting and deploying suitable evaluation methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
