HalluScore: Large Language Model Hallucination Question Answering Benchmark
Aisha Alansari, Hamzah Luqman

TL;DR
HalluScore is a comprehensive Arabic question answering benchmark designed to evaluate, analyze, and mitigate hallucinations in large language models across various reasoning levels, knowledge domains, and cultural contexts.
Contribution
It introduces a novel, structured dataset with 827 questions, ground-truth evidence, and annotations to assess hallucination in Arabic LLMs, addressing a significant resource gap.
Findings
Hallucination patterns vary across different LLMs and are influenced by cultural and linguistic factors.
Arabic LLMs face unique challenges beyond factual inaccuracies, including cultural understanding and logical reasoning.
The benchmark enables detailed analysis and comparison of hallucination behaviors in multilingual and reasoning-capable LLMs.
Abstract
Large language models (LLMs) have achieved remarkable progress in natural language generation, but remain susceptible to hallucination. In response to growing concerns about hallucinations, several benchmarks have been developed, primarily in English and Chinese. However, Arabic remains underrepresented, with limited benchmarks for LLMs hallucination due to scarce annotated resources and the language's morphological complexity. Consequently, existing benchmarks do not adequately reflect the linguistic, cultural, and reasoning characteristics of Arabic. To address this gap, we introduce HalluScore, a structured Arabic question answering benchmark designed to evaluate hallucination behavior in LLMs across different levels of reasoning difficulty, various knowledge domains, historical timelines, and culturally grounded Arabic scenarios. It contains 827 carefully curated questions for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
