Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?
Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi,, Doratossadat Dastgheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, and, Mohammad Hossein Rohban

TL;DR
The Khayyam Challenge (PersianMMLU) provides a comprehensive, culturally nuanced benchmark with over 20,000 questions to evaluate Persian-supporting LLMs across diverse subjects and educational levels.
Contribution
It introduces a new, extensive Persian language evaluation dataset with rich metadata, avoiding translation issues, and offers a scalable framework for assessing LLMs' language understanding and reasoning.
Findings
Existing LLMs show varied performance across tasks.
The benchmark reveals strengths and weaknesses of Persian LLMs.
Rich metadata enables detailed analysis of model capabilities.
Abstract
Evaluating Large Language Models (LLMs) is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc., aimed at assessing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
