Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

Abdullah Mushtaq; Rafay Naeem; Ezieddin Elmahjub; Ibrahim Ghaznavi; Shawqi Al-Maliki; Mohamed Abdallah; Ala Al-Fuqaha; Junaid Qadir

arXiv:2510.24438·cs.CL·October 29, 2025

Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, Junaid Qadir

PDF

TL;DR

This paper evaluates the faithfulness of large language models in generating Islamic content, highlighting their strengths and shortcomings in accuracy and citations, and emphasizes the need for community-driven benchmarks for high-stakes domains.

Contribution

It introduces a dual-agent evaluation framework for assessing LLMs on Islamic content, providing a comparative analysis of GPT-4o, Ansari AI, and Fanar.

Findings

01

GPT-4o scored highest in Islamic accuracy and citations.

02

Ansari AI led in qualitative pairwise wins.

03

Models still struggle with reliable accuracy and citations.

Abstract

Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations -- a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.