Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish

Chengxuan Xia; Qianye Wu; Hongbin Guan; Sixuan Tian; Yilun Hao; Xiaoyu Wu

arXiv:2511.10664·cs.CL·February 13, 2026

Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish

Chengxuan Xia, Qianye Wu, Hongbin Guan, Sixuan Tian, Yilun Hao, Xiaoyu Wu

PDF

Open Access

TL;DR

This paper evaluates seven large language models on low-resource and morphologically rich languages—Cantonese, Japanese, and Turkish—using a new cross-lingual benchmark across four tasks, highlighting strengths and gaps in multilingual and cultural understanding.

Contribution

It introduces a comprehensive cross-lingual benchmark for low-resource languages and evaluates leading LLMs, revealing their performance gaps and challenges in cultural and morphological aspects.

Findings

01

Largest models outperform smaller ones across tasks and languages

02

Significant gaps remain in cultural nuance and morphological generalization

03

GPT-4o shows strong multilingual and cross-lingual performance

Abstract

Large language models (LLMs) have achieved impressive results in high-resource languages like English, yet their effectiveness in low-resource and morphologically rich languages remains underexplored. In this paper, we present a comprehensive evaluation of seven cutting-edge LLMs -- including GPT-4o, GPT-4, Claude~3.5~Sonnet, LLaMA~3.1, Mistral~Large~2, LLaMA-2~Chat~13B, and Mistral~7B~Instruct -- on a new cross-lingual benchmark covering \textbf{Cantonese, Japanese, and Turkish}. Our benchmark spans four diverse tasks: open-domain question answering, document summarization, English-to-X translation, and culturally grounded dialogue. We combine \textbf{human evaluations} (rating fluency, factual accuracy, and cultural appropriateness) with automated metrics (e.g., BLEU, ROUGE) to assess model performance. Our results reveal that while the largest proprietary models (GPT-4o, GPT-4,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods