Evaluating Large Language Models with fmeval
Pola Schw\"obel, Luca Franceschi, Muhammad Bilal Zafar, Keerthan, Vasist, Aman Malhotra, Tomer Shenhar, Pinal Tailor, Pinar Yilmaz, Michael, Diamond, Michele Donini

TL;DR
fmeval is an open source library designed to evaluate large language models across various tasks and responsible AI dimensions, emphasizing simplicity, coverage, extensibility, and performance.
Contribution
This paper introduces fmeval, a comprehensive evaluation library for LLMs, detailing its design principles and implementation, and demonstrating its practical use case.
Findings
Effective evaluation of LLMs across multiple tasks
Facilitates responsible AI assessment
Supports model selection for specific applications
Abstract
fmeval is an open source library to evaluate large language models (LLMs) in a range of tasks. It helps practitioners evaluate their model for task performance and along multiple responsible AI dimensions. This paper presents the library and exposes its underlying design principles: simplicity, coverage, extensibility and performance. We then present how these were implemented in the scientific and engineering choices taken when developing fmeval. A case study demonstrates a typical use case for the library: picking a suitable model for a question answering task. We close by discussing limitations and further work in the development of the library. fmeval can be found at https://github.com/aws/fmeval.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsLib
