MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations
Ernests Lavrinovics, Russa Biswas, Katja Hose, Johannes Bjerva

TL;DR
MultiHal introduces a multilingual, knowledge graph-based benchmark for evaluating and mitigating hallucinations in large language models, leveraging structured factual data across multiple languages and improving factuality assessment.
Contribution
It creates a novel multilingual, multihop knowledge graph benchmark for LLM hallucination evaluation, filling gaps in existing datasets by integrating structured factual resources.
Findings
Improved semantic similarity scores by 0.12 to 0.36 points
Enhanced NLI entailment accuracy by 0.16 to 0.36 points
Increased hallucination detection performance by 0.29 to 0.42 points
Abstract
Large Language Models (LLMs) have inherent limitations of faithfulness and factuality, commonly referred to as hallucinations. Several benchmarks have been developed that provide a test bed for factuality evaluation within the context of English-centric datasets, while relying on supplementary informative context like web links or text passages but ignoring the available structured factual resources. To this end, Knowledge Graphs (KGs) have been identified as a useful aid for hallucination mitigation, as they provide a structured way to represent the facts about entities and their relations with minimal linguistic overhead. We bridge the lack of KG paths and multilinguality for factual language modeling within the existing hallucination evaluation benchmarks and propose a KG-based multilingual, multihop benchmark called MultiHal framed for generative text evaluation. As part of our data…
Peer Reviews
Decision·Submitted to ICLR 2026
+ The approach and use of KG path mining mmakes sense + quality evaluations and baselines analysis has been performed well + good to see several datasets used
Paper has some unclear assumptions and metrics, such as the quality score and the effect of multilinguality (or its important) has not been carefully assessed. small models have been used for analysis , limiting the impact analysis and generalizability. GNN methods not discussed/compared against
Multilingual design fills a gap in cross-lingual factuality evaluation, where low-resource languages often suffer from higher hallucination rates. Comprehensive evaluation using three models (Gemini 2.0 Flash, GPT-4o Mini, Llama 3.3 70bn) and three metrics (semantic similarity, NLI, hallucination detection with HHEM-2.1), ensuring result robustness.
KG path mining is restricted to 2 hops, potentially missing complex relational knowledge required for reasoning-intensive questions. Lack of fine-grained hallucination localization limits the utility for debugging LLM behavior. Reliance on closed-source GPT-4 Mini for path quality evaluation raises concerns about reproducibility. No multi-prompt evaluation, despite the known sensitivity of LLM performance to prompt formatting. Semantic similarity metrics underestimate performance, as model respo
1. The resource aims to bridge an important gap -- lacking good multilingual datasources for evaluating LLM hallucination. 2. The resource has been analysed with many evaluation results, including baseline experimentation.
1. LLM is used to assess the quality of the extracted knowledge graph paths, by analysing the correlation with semantic scores between the predicted and gold answers for each question. Sometimes, when a path is incomplete or only partially correct, the LLM can still give the ground truth answer. I'm afraid this method cannot ensure the fair evaluation of the quality of the paths themselves. 2. The knowledge graphs are incomplete. The potential impacts of this factor are not considered in const
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies
