Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

Nikita Martynov; Anastasia Mordasheva; Dmitriy Gorbetskiy; Danil Astafurov; Ulyana Isaeva; Elina Basyrova; Sergey Skachkov; Victoria Berestova; Nikolay Ivanov; Valeriia Zanina; Alena Fenogenova

arXiv:2505.24616·cs.CL·December 2, 2025

Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

Nikita Martynov, Anastasia Mordasheva, Dmitriy Gorbetskiy, Danil Astafurov, Ulyana Isaeva, Elina Basyrova, Sergey Skachkov, Victoria Berestova, Nikolay Ivanov, Valeriia Zanina, Alena Fenogenova

PDF

Open Access 4 Models 1 Datasets

TL;DR

This paper introduces POLLUX, an open-source benchmark with a novel interpretability-focused evaluation methodology for Russian LLMs, including a detailed taxonomy, scoring protocol, and LLM-based evaluators.

Contribution

The paper presents a new evaluation framework for Russian LLMs that emphasizes interpretability and transparency, along with a comprehensive benchmark dataset and LLM-based evaluators.

Findings

01

POLLUX enables transparent, criteria-driven evaluation of LLMs.

02

The benchmark covers 35 diverse task types with 2,100 prompts.

03

LLM-based evaluators provide nuanced assessments comparable to human judgments.

Abstract

We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

ai-forever/POLLUX
dataset· 305 dl
305 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Legal Education and Practice Innovations

MethodsSparse Evolutionary Training