Reliable, Reproducible, and Really Fast Leaderboards with Evalica
Dmitry Ustalov

TL;DR
Evalica is an open-source toolkit designed to create reliable, reproducible, and fast leaderboards for NLP models, supporting various interfaces to enhance evaluation protocols with human and machine feedback.
Contribution
It introduces Evalica, a novel toolkit that streamlines the development of trustworthy and efficient model leaderboards for NLP research.
Findings
Evalica achieves high reliability and reproducibility in leaderboard creation.
The toolkit is user-friendly with multiple interfaces including Web, CLI, and Python API.
Evaluation shows Evalica's performance and usability in real-world NLP model assessments.
Abstract
The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Artificial Intelligence in Games
