MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

Artem Chervyakov; Alexander Kharitonov; Pavel Zadorozhny; Adamenko Pavel; Rodion Levichev; Dmitrii Vorobev; Dmitrii Salikhov; Aidar Valeev; Alena Pestova; Maria Dziuba; Ilseyar Alimova; Artem Zavgorodnev; Aleksandr Medvedev; Stanislav Moiseev; Elena Bruches; Daniil Grebenkin; Roman Derunets; Vikulov Vladimir; Anton Emelyanov; Dmitrii Babaev; Vladimir V. Ivanov; Valentin Malykh; Alena Fenogenova

arXiv:2507.12284·cs.SE·December 2, 2025

MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

Artem Chervyakov, Alexander Kharitonov, Pavel Zadorozhny, Adamenko Pavel, Rodion Levichev, Dmitrii Vorobev, Dmitrii Salikhov, Aidar Valeev, Alena Pestova, Maria Dziuba, Ilseyar Alimova, Artem Zavgorodnev, Aleksandr Medvedev, Stanislav Moiseev, Elena Bruches, Daniil Grebenkin

PDF

Open Access

TL;DR

MERA Code introduces a comprehensive benchmark for evaluating the practical coding skills of language models across multiple programming languages and tasks, addressing gaps in current evaluation methods.

Contribution

It presents a new multilingual, multi-task benchmark with an open-source framework, scoring system, and platform for standardized code generation evaluation.

Findings

01

Open LLMs show limitations in practical coding tasks.

02

Benchmark reveals language and task-specific performance gaps.

03

MERA Code facilitates standardized assessment and comparison.

Abstract

Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel-Driven Software Engineering Techniques · Software Engineering Research · Teaching and Learning Programming