Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content
Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, Junaid Qadir

TL;DR
This paper evaluates the faithfulness of large language models in generating Islamic content, highlighting their strengths and shortcomings in accuracy and citations, and emphasizes the need for community-driven benchmarks for high-stakes domains.
Contribution
It introduces a dual-agent evaluation framework for assessing LLMs on Islamic content, providing a comparative analysis of GPT-4o, Ansari AI, and Fanar.
Findings
GPT-4o scored highest in Islamic accuracy and citations.
Ansari AI led in qualitative pairwise wins.
Models still struggle with reliable accuracy and citations.
Abstract
Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations -- a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
