TL;DR
This paper introduces ZNO-Eval, a comprehensive benchmark for assessing reasoning abilities of large language models in Ukrainian, based on real exam tasks across multiple subjects, revealing strengths and gaps in current models.
Contribution
The paper presents the first Ukrainian reasoning benchmark derived from standardized exams, enabling detailed evaluation of LLMs across diverse subjects and complexities.
Findings
GPT-4o outperforms others in reasoning and language tasks.
Gemini Pro and GPT-4 Turbo excel in arithmetic problems.
Models perform near maximum in history and geography, but lag in Ukrainian language and math.
Abstract
As the usage of large language models for problems outside of simple text understanding or generation increases, assessing their abilities and limitations becomes crucial. While significant progress has been made in this area over the last few years, most research has focused on benchmarking English, leaving other languages underexplored. This makes evaluating the reasoning and robustness level of language models in Ukrainian particularly challenging. The purpose of this work is to establish a comprehensive benchmark for the reasoning capabilities evaluation of large language models in the Ukrainian language. This paper presents the ZNO-Eval benchmark based on real exam tasks from Ukraine's standardized educational testing system: the External Independent Evaluation and the National Multi-subject Test. With single-answer options, multiple-choice, matching, and open-ended questions from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Absolute Position Encodings · Cosine Annealing · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer
