B\"{u}y\"{u}k Dil Modelleri i\c{c}in TR-MMLU Benchmark{\i}: Performans De\u{g}erlendirmesi, Zorluklar ve \.{I}yile\c{s}tirme F{\i}rsatlar{\i}

M. Ali Bayram; Ali Arda Fincan; Ahmet Semih G\"um\"u\c{s}; Banu Diri; Sava\c{s} Y{\i}ld{\i}r{\i}m; \"Oner Ayta\c{s}

arXiv:2508.13044·cs.CL·August 19, 2025

B\"{u}y\"{u}k Dil Modelleri i\c{c}in TR-MMLU Benchmark{\i}: Performans De\u{g}erlendirmesi, Zorluklar ve \.{I}yile\c{s}tirme F{\i}rsatlar{\i}

M. Ali Bayram, Ali Arda Fincan, Ahmet Semih G\"um\"u\c{s}, Banu Diri, Sava\c{s} Y{\i}ld{\i}r{\i}m, \"Oner Ayta\c{s}

PDF

TL;DR

This paper introduces the TR-MMLU benchmark, a comprehensive evaluation framework with 6,200 questions across 62 sections, to assess large language models' capabilities in Turkish, addressing evaluation challenges for resource-limited languages.

Contribution

The paper presents the TR-MMLU benchmark, the first extensive Turkish language model evaluation dataset, enabling detailed performance analysis and highlighting areas for model improvement.

Findings

01

State-of-the-art LLMs show promising performance but need improvements in Turkish language understanding.

02

TR-MMLU provides a new standard for Turkish NLP evaluation.

03

Benchmark results reveal specific linguistic and conceptual challenges for LLMs in Turkish.

Abstract

Language models have made significant advancements in understanding and generating human language, achieving remarkable success in various applications. However, evaluating these models remains a challenge, particularly for resource-limited languages like Turkish. To address this issue, we introduce the Turkish MMLU (TR-MMLU) benchmark, a comprehensive evaluation framework designed to assess the linguistic and conceptual capabilities of large language models (LLMs) in Turkish. TR-MMLU is based on a meticulously curated dataset comprising 6,200 multiple-choice questions across 62 sections within the Turkish education system. This benchmark provides a standard framework for Turkish NLP research, enabling detailed analyses of LLMs' capabilities in processing Turkish text. In this study, we evaluated state-of-the-art LLMs on TR-MMLU, highlighting areas for improvement in model design.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.