MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations

Ernests Lavrinovics; Russa Biswas; Katja Hose; Johannes Bjerva

arXiv:2505.14101·cs.CL·October 24, 2025

MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations

Ernests Lavrinovics, Russa Biswas, Katja Hose, Johannes Bjerva

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

MultiHal introduces a multilingual, knowledge graph-based benchmark for evaluating and mitigating hallucinations in large language models, leveraging structured factual data across multiple languages and improving factuality assessment.

Contribution

It creates a novel multilingual, multihop knowledge graph benchmark for LLM hallucination evaluation, filling gaps in existing datasets by integrating structured factual resources.

Findings

01

Improved semantic similarity scores by 0.12 to 0.36 points

02

Enhanced NLI entailment accuracy by 0.16 to 0.36 points

03

Increased hallucination detection performance by 0.29 to 0.42 points

Abstract

Large Language Models (LLMs) have inherent limitations of faithfulness and factuality, commonly referred to as hallucinations. Several benchmarks have been developed that provide a test bed for factuality evaluation within the context of English-centric datasets, while relying on supplementary informative context like web links or text passages but ignoring the available structured factual resources. To this end, Knowledge Graphs (KGs) have been identified as a useful aid for hallucination mitigation, as they provide a structured way to represent the facts about entities and their relations with minimal linguistic overhead. We bridge the lack of KG paths and multilinguality for factual language modeling within the existing hallucination evaluation benchmarks and propose a KG-based multilingual, multihop benchmark called MultiHal framed for generative text evaluation. As part of our data…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

+ The approach and use of KG path mining mmakes sense + quality evaluations and baselines analysis has been performed well + good to see several datasets used

Weaknesses

Paper has some unclear assumptions and metrics, such as the quality score and the effect of multilinguality (or its important) has not been carefully assessed. small models have been used for analysis , limiting the impact analysis and generalizability. GNN methods not discussed/compared against

Reviewer 02Rating 4Confidence 4

Strengths

Multilingual design fills a gap in cross-lingual factuality evaluation, where low-resource languages often suffer from higher hallucination rates. Comprehensive evaluation using three models (Gemini 2.0 Flash, GPT-4o Mini, Llama 3.3 70bn) and three metrics (semantic similarity, NLI, hallucination detection with HHEM-2.1), ensuring result robustness.

Weaknesses

KG path mining is restricted to 2 hops, potentially missing complex relational knowledge required for reasoning-intensive questions. Lack of fine-grained hallucination localization limits the utility for debugging LLM behavior. Reliance on closed-source GPT-4 Mini for path quality evaluation raises concerns about reproducibility. No multi-prompt evaluation, despite the known sensitivity of LLM performance to prompt formatting. Semantic similarity metrics underestimate performance, as model respo

Reviewer 03Rating 6Confidence 3

Strengths

1. The resource aims to bridge an important gap -- lacking good multilingual datasources for evaluating LLM hallucination. 2. The resource has been analysed with many evaluation results, including baseline experimentation.

Weaknesses

1. LLM is used to assess the quality of the extracted knowledge graph paths, by analysing the correlation with semantic scores between the predicted and gold answers for each question. Sometimes, when a path is incomplete or only partially correct, the LLM can still give the ground truth answer. I'm afraid this method cannot ensure the fair evaluation of the quality of the paths themselves. 2. The knowledge graphs are incomplete. The potential impacts of this factor are not considered in const

Code & Models

Repositories

ernlavr/multihal
noneOfficial

Datasets

ernlavr/multihal
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies