P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs
Yidan Zhang, Yu Wan, Boyi Deng, Baosong Yang, Haoran Wei, Fei Huang, Bowen Yu, Junyang Lin, Fei Huang, Jingren Zhou

TL;DR
P-MMEval is a comprehensive multilingual multitask benchmark designed to evaluate large language models across diverse tasks and languages, providing consistent, parallel datasets to facilitate detailed performance analysis and knowledge transfer insights.
Contribution
This paper introduces P-MMEval, a large-scale, multilingual, multitask benchmark with parallel datasets, enabling consistent evaluation and comparison of LLMs across tasks and languages.
Findings
Multilingual models show varied performance across tasks and languages.
Knowledge transfer from English improves performance in other languages.
Model size and prompt design significantly impact results.
Abstract
Recent advancements in large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. To alleviate this drawback, we aim to present a comprehensive multilingual multitask benchmark. First, we introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets. Furthermore, P-MMEval delivers consistent language coverage across various datasets and provides parallel samples. Finally, we conduct extensive experiments on representative multilingual model series to compare performances across models and tasks, explore the relationship between multilingual performances and factors such as tasks, model sizes, languages, and prompts, and examine the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
