TL;DR
Cetvel is a comprehensive Turkish benchmark evaluating large language models across diverse tasks, emphasizing linguistic and cultural content, revealing performance gaps in Turkish-specific models compared to multilingual ones.
Contribution
Introduces Cetvel, a novel Turkish benchmark with diverse, culturally relevant tasks, filling gaps in existing Turkish LLM evaluation frameworks.
Findings
Turkish instruction-tuned models underperform compared to multilingual models.
Grammatical error correction and extractive QA are highly discriminative tasks.
Multilingual models like Llama 3 outperform Turkish-specific models.
Abstract
We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. Cetvel covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
