Reliable, Reproducible, and Really Fast Leaderboards with Evalica

Dmitry Ustalov

arXiv:2412.11314·cs.CL·December 17, 2024

Reliable, Reproducible, and Really Fast Leaderboards with Evalica

Dmitry Ustalov

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Evalica is an open-source toolkit designed to create reliable, reproducible, and fast leaderboards for NLP models, supporting various interfaces to enhance evaluation protocols with human and machine feedback.

Contribution

It introduces Evalica, a novel toolkit that streamlines the development of trustworthy and efficient model leaderboards for NLP research.

Findings

01

Evalica achieves high reliability and reproducibility in leaderboard creation.

02

The toolkit is user-friendly with multiple interfaces including Web, CLI, and Python API.

03

Evaluation shows Evalica's performance and usability in real-world NLP model assessments.

Abstract

The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dustalov/evalica
noneOfficial

Datasets

dustalov/llmfao
dataset· 59 dl
59 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Artificial Intelligence in Games