MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language

Farhan Farsi; Farnaz Aghababaloo; Shahriar Shariati Motlagh; Parsa Ghofrani; MohammadAli SadraeiJavaheri; Shayan Bali; Amirhossein Shabani; Farbod Bijary; Ghazal Zamaninejad; AmirMohammad Salehoof; Saeedeh Momtazi

arXiv:2508.00673·cs.CL·August 4, 2025

MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language

Farhan Farsi, Farnaz Aghababaloo, Shahriar Shariati Motlagh, Parsa Ghofrani, MohammadAli SadraeiJavaheri, Shayan Bali, Amirhossein Shabani, Farbod Bijary, Ghazal Zamaninejad, AmirMohammad Salehoof, Saeedeh Momtazi

PDF

Open Access

TL;DR

This paper introduces new Persian language evaluation datasets to assess large language models' performance and cultural understanding, addressing the lack of non-Western cultural benchmarks.

Contribution

It presents 19 novel datasets for Persian language and culture, and benchmarks 41 LLMs to fill evaluation gaps for non-Western contexts.

Findings

01

Identified significant performance gaps in LLMs on Persian cultural tasks.

02

Provided comprehensive benchmarks for 41 LLMs in Persian language understanding.

03

Highlighted the need for culturally diverse evaluation resources.

Abstract

As large language models (LLMs) become increasingly embedded in our daily lives, evaluating their quality and reliability across diverse contexts has become essential. While comprehensive benchmarks exist for assessing LLM performance in English, there remains a significant gap in evaluation resources for other languages. Moreover, because most LLMs are trained primarily on data rooted in European and American cultures, they often lack familiarity with non-Western cultural contexts. To address this limitation, our study focuses on the Persian language and Iranian culture. We introduce 19 new evaluation datasets specifically designed to assess LLMs on topics such as Iranian law, Persian grammar, Persian idioms, and university entrance exams. Using these datasets, we benchmarked 41 prominent LLMs, aiming to bridge the existing cultural and linguistic evaluation gap in the field.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling · Natural Language Processing Techniques