Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions

Farah Atif; Nursultan Askarbekuly; Kareem Darwish; Monojit Choudhury

arXiv:2508.08287·cs.CL·August 13, 2025

Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions

Farah Atif, Nursultan Askarbekuly, Kareem Darwish, Monojit Choudhury

PDF

Open Access 1 Datasets

TL;DR

This paper introduces FiqhQA, a benchmark for evaluating LLMs on Islamic legal questions across different schools and languages, focusing on accuracy and abstention to ensure reliable religious guidance.

Contribution

It presents the first benchmark for Islamic jurisprudence questions assessing LLM accuracy and abstention, highlighting model variations and language limitations.

Findings

01

GPT-4o achieves highest accuracy

02

Fanar and Gemini excel in abstention behavior

03

Models perform worse in Arabic language

Abstract

Despite the increasing usage of Large Language Models (LLMs) in answering questions in a variety of domains, their reliability and accuracy remain unexamined for a plethora of domains including the religious domains. In this paper, we introduce a novel benchmark FiqhQA focused on the LLM generated Islamic rulings explicitly categorized by the four major Sunni schools of thought, in both Arabic and English. Unlike prior work, which either overlooks the distinctions between religious school of thought or fails to evaluate abstention behavior, we assess LLMs not only on their accuracy but also on their ability to recognize when not to answer. Our zero-shot and abstention experiments reveal significant variation across LLMs, languages, and legal schools of thought. While GPT-4o outperforms all other models in accuracy, Gemini and Fanar demonstrate superior abstention behavior critical for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MBZUAI/FiqhQA
dataset· 40 dl
40 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Artificial Intelligence in Law