ProverbEval: Exploring LLM Evaluation Challenges for Low-resource   Language Understanding

Israel Abebe Azime; Atnafu Lambebo Tonja; Tadesse Destaw Belay; Yonas; Chanie; Bontu Fufa Balcha; Negasi Haile Abadi; Henok Biadglign Ademtew,; Mulubrhan Abebe Nerea; Debela Desalegn Yadeta; Derartu Dagne Geremew; Assefa; Atsbiha tesfau; Philipp Slusallek; Thamar Solorio; Dietrich Klakow

arXiv:2411.05049·cs.CL·February 11, 2025

ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding

Israel Abebe Azime, Atnafu Lambebo Tonja, Tadesse Destaw Belay, Yonas, Chanie, Bontu Fufa Balcha, Negasi Haile Abadi, Henok Biadglign Ademtew,, Mulubrhan Abebe Nerea, Debela Desalegn Yadeta, Derartu Dagne Geremew, Assefa, Atsbiha tesfau, Philipp Slusallek, Thamar Solorio

PDF

Open Access 2 Datasets 1 Video

TL;DR

ProverbEval introduces a benchmark for evaluating low-resource language understanding in cultural contexts, highlighting factors like answer choice order and language that affect LLM performance variances.

Contribution

This work presents ProverbEval, a novel benchmark specifically designed for low-resource languages, emphasizing cultural aspects and analyzing factors influencing LLM evaluation outcomes.

Findings

01

Performance varies up to 50% based on answer choice order.

02

Native proverb descriptions improve task performance.

03

Monolingual evaluations outperform cross-lingual ones.

Abstract

With the rapid development of evaluation datasets to assess LLMs understanding across a wide range of subjects and domains, identifying a suitable language understanding benchmark has become increasingly challenging. In this work, we explore LLM evaluation challenges for low-resource language understanding and introduce \proverbeval, LLM evaluation benchmark for low-resource languages, focusing on low-resource language understanding in culture-specific scenarios. We benchmark various LLMs and explore factors that create variability in the benchmarking process. We observed performance variances of up to 50\%, depending on the order in which answer choices were presented in multiple-choice tasks. Native language proverb descriptions significantly improve tasks such as proverb generation, contributing to improved outcomes. Additionally, monolingual evaluations consistently outperformed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding· underline

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Focus