Improving LLM Leaderboards with Psychometrical Methodology

Denis Federiakin

arXiv:2501.17200·cs.CL·January 30, 2025

Improving LLM Leaderboards with Psychometrical Methodology

Denis Federiakin

PDF

Open Access

TL;DR

This paper proposes applying psychometric methodologies, traditionally used in human assessments, to improve the evaluation and ranking of large language models on leaderboards, leading to more robust and meaningful comparisons.

Contribution

It introduces psychometric techniques into LLM benchmarking, replacing simplistic aggregation methods with more rigorous evaluation approaches.

Findings

01

Psychometric methods improve LLM ranking robustness.

02

Psychometric evaluation provides more meaningful performance insights.

03

Comparison shows advantages over naive averaging methods.

Abstract

The rapid development of large language models (LLMs) has necessitated the creation of benchmarks to evaluate their performance. These benchmarks resemble human tests and surveys, as they consist of sets of questions designed to measure emergent properties in the cognitive behavior of these systems. However, unlike the well-defined traits and abilities studied in social sciences, the properties measured by these benchmarks are often vaguer and less rigorously defined. The most prominent benchmarks are often grouped into leaderboards for convenience, aggregating performance metrics and enabling comparisons between models. Unfortunately, these leaderboards typically rely on simplistic aggregation methods, such as taking the average score across benchmarks. In this paper, we demonstrate the advantages of applying contemporary psychometric methodologies - originally developed for human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsERP Systems Implementation and Impact