MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering

Adil Bahaj; Mounir Ghogho

arXiv:2508.16357·cs.CL·August 25, 2025

MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering

Adil Bahaj, Mounir Ghogho

PDF

1 Datasets

TL;DR

MizanQA is a new benchmark dataset for evaluating large language models on Moroccan legal questions, highlighting challenges in low-resource, complex linguistic and legal contexts, and revealing significant performance gaps.

Contribution

The paper introduces MizanQA, a comprehensive Moroccan legal question answering benchmark dataset for LLM evaluation, emphasizing the need for culturally and domain-specific model development.

Findings

01

Multilingual and Arabic LLMs perform poorly on MizanQA.

02

The dataset captures complex legal reasoning in Arabic and French.

03

Results show significant gaps in current LLM capabilities for legal NLP.

Abstract

The rapid advancement of large language models (LLMs) has significantly propelled progress in natural language processing (NLP). However, their effectiveness in specialized, low-resource domains-such as Arabic legal contexts-remains limited. This paper introduces MizanQA (pronounced Mizan, meaning "scale" in Arabic, a universal symbol of justice), a benchmark designed to evaluate LLMs on Moroccan legal question answering (QA) tasks, characterised by rich linguistic and legal complexity. The dataset draws on Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences. Comprising over 1,700 multiple-choice questions, including multi-answer formats, MizanQA captures the nuances of authentic legal reasoning. Benchmarking experiments with multilingual and Arabic-focused LLMs reveal substantial performance gaps, highlighting the need for tailored…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MuhammadHelmy/tiny-aya-base-blind-spots-test
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.