Statistical multi-metric evaluation and visualization of LLM system   predictive performance

Samuel Ackerman; Eitan Farchi; Orna Raz; Assaf Toledo

arXiv:2501.18243·stat.AP·January 31, 2025

Statistical multi-metric evaluation and visualization of LLM system predictive performance

Samuel Ackerman, Eitan Farchi, Orna Raz, Assaf Toledo

PDF

Open Access

TL;DR

This paper introduces a statistical evaluation and visualization framework for multi-metric assessment of large language models, enabling rigorous comparison across datasets and configurations.

Contribution

The authors develop an automated framework that performs appropriate statistical tests, aggregates results across metrics and datasets, and visualizes performance differences.

Findings

01

Framework successfully applied to CrossCodeEval benchmark

02

Enables significance testing of performance differences

03

Supports decision-making in LLM system improvements

Abstract

The evaluation of generative or discriminative large language model (LLM)-based systems is often a complex multi-dimensional problem. Typically, a set of system configuration alternatives are evaluated on one or more benchmark datasets, each with one or more evaluation metrics, which may differ between datasets. We often want to evaluate -- with a statistical measure of significance -- whether systems perform differently either on a given dataset according to a single metric, on aggregate across metrics on a dataset, or across datasets. Such evaluations can be done to support decision-making, such as deciding whether a particular system component change (e.g., choice of LLM or hyperparameter values) significantly improves performance over the current system configuration, or, more generally, whether a fixed set of system configurations (e.g., a leaderboard list) have significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWireless Sensor Networks and IoT · Advanced Algorithms and Applications · Power Systems and Technologies

MethodsSparse Evolutionary Training