Uhura: A Benchmark for Evaluating Scientific Question Answering and   Truthfulness in Low-Resource African Languages

Edward Bayes; Israel Abebe Azime; Jesujoba O. Alabi; Jonas Kgomo; Tyna; Eloundou; Elizabeth Proehl; Kai Chen; Imaan Khadir; Naome A. Etori,; Shamsuddeen Hassan Muhammad; Choice Mpanza; Igneciah Pocia Thete; Dietrich; Klakow; David Ifeoluwa Adelani

arXiv:2412.00948·cs.CL·December 3, 2024

Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages

Edward Bayes, Israel Abebe Azime, Jesujoba O. Alabi, Jonas Kgomo, Tyna, Eloundou, Elizabeth Proehl, Kai Chen, Imaan Khadir, Naome A. Etori,, Shamsuddeen Hassan Muhammad, Choice Mpanza, Igneciah Pocia Thete, Dietrich, Klakow, David Ifeoluwa Adelani

PDF

Open Access 2 Datasets

TL;DR

This paper introduces Uhura, a benchmark for evaluating the performance and truthfulness of large language models in six low-resource African languages across scientific and safety-related tasks, highlighting significant performance gaps and the need for improved multilingual NLP in these languages.

Contribution

The paper presents Uhura, a new benchmark created through human translation for low-resource African languages, addressing the lack of datasets and evaluating models on scientific and safety tasks.

Findings

01

Models perform significantly worse in African languages compared to English.

02

Proprietary models outperform open-source models in the benchmark.

03

All models show increased false claims in low-resource languages.

Abstract

Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primarily because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducation and Critical Thinking Development · Educational Strategies and Epistemologies

MethodsLLaMA · Focus