TurkBench: A Benchmark for Evaluating Turkish Large Language Models

\c{C}a\u{g}r{\i} Toraman; Ahmet Kaan Sever; Ayse Aysu Cengiz; Elif Ecem Arslan; G\"orkem Sevin\c{c}; Mete Mert Birdal; Yusuf Faruk G\"uldemir; Ali Bu\u{g}ra Kanburo\u{g}lu; Sezen Feleko\u{g}lu; Osman G\"urlek; Sarp Kantar; Birsen \c{S}ahin K\"ut\"uk; B\"u\c{s}ra Tufan; Elif Gen\c{c}; Serkan Co\c{s}kun; Gupse Ekin Demir; Muhammed Emin Aray{\i}c{\i}; Olgun Dursun; Onur Gungor; Susan \"Usk\"udarl{\i}; Abdullah Topraksoy; Esra Dar{\i}c{\i}

arXiv:2601.07020·cs.CL·February 4, 2026

TurkBench: A Benchmark for Evaluating Turkish Large Language Models

\c{C}a\u{g}r{\i} Toraman, Ahmet Kaan Sever, Ayse Aysu Cengiz, Elif Ecem Arslan, G\"orkem Sevin\c{c}, Mete Mert Birdal, Yusuf Faruk G\"uldemir, Ali Bu\u{g}ra Kanburo\u{g}lu, Sezen Feleko\u{g}lu, Osman G\"urlek, Sarp Kantar, Birsen \c{S}ahin K\"ut\"uk, B\"u\c{s}ra Tufan

PDF

Open Access 1 Video

TL;DR

TurkBench is a comprehensive evaluation benchmark specifically designed for assessing the performance of large language models in Turkish across multiple linguistic and reasoning tasks.

Contribution

This paper introduces TurkBench, the first extensive Turkish language model benchmark with over 8,000 samples across 21 subtasks, filling a critical gap in language-specific evaluation tools.

Findings

01

TurkBench covers diverse tasks including knowledge, reasoning, and grammar.

02

It provides a culturally relevant dataset for Turkish language model evaluation.

03

The benchmark is publicly available for online submissions.

Abstract

With the recent surge in the development of large language models, the need for comprehensive and language-specific evaluation benchmarks has become critical. While significant progress has been made in evaluating English-language models, benchmarks for other languages, particularly those with unique linguistic characteristics such as Turkish, remain less developed. Our study introduces TurkBench, a comprehensive benchmark designed to assess the capabilities of generative large language models in the Turkish language. TurkBench involves 8,151 data samples across 21 distinct subtasks. These are organized under six main categories of evaluation: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar and Vocabulary, and Instruction Following. The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

TurkBench: A Benchmark for Evaluating Turkish Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification