MERA: A Comprehensive LLM Evaluation in Russian

Alena Fenogenova; Artem Chervyakov; Nikita Martynov; Anastasia; Kozlova; Maria Tikhonova; Albina Akhmetgareeva; Anton Emelyanov; Denis; Shevelev; Pavel Lebedev; Leonid Sinev; Ulyana Isaeva; Katerina Kolomeytseva,; Daniil Moskovskiy; Elizaveta Goncharova; Nikita Savushkin; Polina Mikhailova,; Denis Dimitrov; Alexander Panchenko; Sergei Markov

arXiv:2401.04531·cs.CL·August 5, 2024·2 cites

MERA: A Comprehensive LLM Evaluation in Russian

Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia, Kozlova, Maria Tikhonova, Albina Akhmetgareeva, Anton Emelyanov, Denis, Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Katerina Kolomeytseva,, Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin

PDF

Open Access 1 Video

TL;DR

This paper introduces MERA, an open benchmark for evaluating Russian-language foundation models across multiple skills, providing a standardized, comprehensive assessment framework to understand their capabilities and limitations.

Contribution

The paper presents MERA, a new multimodal evaluation benchmark with a methodology, open-source tools, and a leaderboard for assessing Russian foundation models in zero- and few-shot settings.

Findings

01

Open LMs lag behind human performance.

02

MERA covers 21 tasks across 11 skill domains.

03

Benchmark facilitates standardized evaluation of Russian LMs.

Abstract

Over the past few years, one of the most notable advancements in AI research has been in foundation models (FMs), headlined by the rise of language models (LMs). As the models' size increases, LMs demonstrate enhancements in measurable aspects and the development of new qualitative features. However, despite researchers' attention and the rapid growth in LM application, the capabilities, limitations, and associated risks still need to be better understood. To address these issues, we introduce an open Multimodal Evaluation of Russian-language Architectures (MERA), a new instruction benchmark for evaluating foundation models oriented towards the Russian language. The benchmark encompasses 21 evaluation tasks for generative models in 11 skill domains and is designed as a black-box test to ensure the exclusion of data leakage. The paper introduces a methodology to evaluate FMs and LMs in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MERA: A Comprehensive LLM Evaluation in Russian· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsBalanced Selection