Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU
Fajri Koto, Nurul Aisyah, Haonan Li, Timothy Baldwin

TL;DR
This paper introduces IndoMMLU, a comprehensive Indonesian language understanding benchmark, revealing that GPT-3.5 and other models only pass primary school level in Indonesia and have limited local cultural knowledge.
Contribution
The creation of IndoMMLU as the first multi-task benchmark for Indonesian language and culture understanding, enabling better evaluation of multilingual LLMs beyond English.
Findings
GPT-3.5 passes primary school exams in Indonesia
Smaller models perform even worse
Limited knowledge of Indonesian languages and culture
Abstract
Although large language models (LLMs) are often pre-trained on large-scale multilingual texts, their reasoning abilities and real-world knowledge are mainly evaluated based on English datasets. Assessing LLM capabilities beyond English is increasingly vital but hindered due to the lack of suitable datasets. In this work, we introduce IndoMMLU, the first multi-task language understanding benchmark for Indonesian culture and languages, which consists of questions from primary school to university entrance exams in Indonesia. By employing professional teachers, we obtain 14,981 questions across 64 tasks and education levels, with 46% of the questions focusing on assessing proficiency in the Indonesian language and knowledge of nine local languages and cultures in Indonesia. Our empirical evaluations show that GPT-3.5 only manages to pass the Indonesian primary school level, with limited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Sahabat-AI/Llama-Sahabat-AI-v2-70B-ITmodel· 110 dl· ♡ 13110 dl♡ 13
- 🤗GoToCompany/gemma2-9b-cpt-sahabatai-v1-instructmodel· 1.0k dl· ♡ 471.0k dl♡ 47
- 🤗GoToCompany/llama3-8b-cpt-sahabatai-v1-instructmodel· 442 dl· ♡ 13442 dl♡ 13
- 🤗gmonsoon/gemma2-9b-cpt-sahabatai-v1-instruct-GGUFmodel· 213 dl· ♡ 6213 dl♡ 6
- 🤗gmonsoon/llama3-8b-cpt-sahabatai-v1-instruct-GGUFmodel· 75 dl· ♡ 275 dl♡ 2
- 🤗fritzwijaya/llama3-8b-cpt-sahabatai-v1-instruct-ggufmodel· 1 dl1 dl
- 🤗QuantFactory/gemma2-9b-cpt-sahabatai-v1-instruct-GGUFmodel· 39 dl· ♡ 439 dl♡ 4
- 🤗Sahabat-AI/gemma2-9b-cpt-sahabatai-v1-instructmodel· 69k dl· ♡ 469k dl♡ 4
- 🤗Sahabat-AI/llama3-8b-cpt-sahabatai-v1-instructmodel· 375 dl· ♡ 2375 dl♡ 2
- 🤗GoToCompany/Llama-Sahabat-AI-v2-70B-ITmodel· 5 dl· ♡ 85 dl♡ 8
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Cosine Annealing · Layer Normalization · Dropout · Weight Decay · Multi-Head Attention · {Dispute@FaQ-s}How to file a dispute with Expedia? · Softmax · Byte Pair Encoding
