FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems
Mahesh Kumar, Bhaskarjit Sarmah, Stefano Pasquali

TL;DR
This paper introduces FinReflectKG -- HalluBench, a benchmark dataset for evaluating hallucination detection methods in financial question-answering systems augmented with knowledge graphs, highlighting current vulnerabilities and robustness of various approaches.
Contribution
The paper presents a new benchmark dataset and evaluation framework for hallucination detection in KG-augmented financial QA, including analysis of multiple detection methods under noisy conditions.
Findings
LLM judges and embedding methods perform best under clean conditions.
Detection methods significantly degrade with noisy KG triplets.
Embedding methods show greater robustness to noise.
Abstract
As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA systems lack systematic mechanisms to detect hallucinations - factually incorrect outputs that undermine reliability and user trust. We introduce FinBench-QA-Hallucination, a benchmark for evaluating hallucination detection methods in KG-augmented financial QA over SEC 10-K filings. The dataset contains 755 annotated examples from 300 pages, each labeled for groundedness using a conservative evidence-linkage protocol requiring support from both textual chunks and extracted relational triplets. We evaluate six detection approaches - LLM judges, fine-tuned classifiers, Natural Language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Misinformation and Its Impacts
