Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering
Yuan Sui, Yufei He, Zifeng Ding, Bryan Hooi

TL;DR
This paper introduces OKGQA, a new benchmark for evaluating how effectively Knowledge Graphs can improve the trustworthiness and reasoning abilities of Large Language Models in open-ended question answering scenarios, including scenarios with noisy KGs.
Contribution
The paper presents OKGQA, a novel benchmark for assessing LLMs augmented with KGs in real-world, open-ended tasks, and evaluates the impact of KG errors on model performance.
Findings
KGs can enhance LLM reasoning in open-ended tasks.
Performance degrades with noisy or contaminated KGs.
The benchmark enables comprehensive evaluation of trustworthiness improvements.
Abstract
Recent works integrating Knowledge Graphs (KGs) have shown promising improvements in enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing benchmarks primarily focus on closed-ended tasks, leaving a gap in evaluating performance on more complex, real-world scenarios. This limitation also hinders a thorough assessment of KGs' potential to reduce hallucinations in LLMs. To address this, we introduce OKGQA, a new benchmark specifically designed to evaluate LLMs augmented with KGs in open-ended, real-world question answering settings. OKGQA reflects practical complexities through diverse question types and incorporates metrics to quantify both hallucination rates and reasoning improvements in LLM+KG models. To consider the scenarios in which KGs may contain varying levels of errors, we propose a benchmark variant, OKGQA-P, to assess model performance when…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The introduction of OKGQA as a novel benchmark can address a current research gap in assessing LLMs in open-ended, real-world scenarios. 2. The perturbed benchmark, OKGQA-P, allows for the evaluation of LLM robustness in response to inaccuracies or noises in KGs. 3. The comprehensive and well-organized experimental results across different forms of information and different types of queries show the effectiveness of the proposed methods and can provide valuable insights for future researche
1. The proposed methodology of G-retrieval and G-generator is similar to the existing line of work on RAG and KG-augmented generation [1]. However, there is a lack of comparison to demonstrate how these proposed methods fundamentally differ from previous methods applied to closed-end QA. 2. The queries are generated using predefined templates with LLMs, which raises concerns about their ability to authentically represent the distribution and complexity of real-world questions. 3. The proposed OK
1. The effectiveness of different retrieval methods in conjunction with LLMs is analyzed in experiments that provide insights into the combination of KGs and LLMs.
1. The unique contributions of the OKGQA benchmark are insufficiently defined, and the distinctions between OKGQA and existing benchmarks are not clearly articulated. 2. While the paper references several closed-ended elements from related literature, it lacks a thorough discussion of limitations and practical implications. 3. The question of whether KGs can reduce hallucinations in LLMs is widely recognized as affirmative, given the effectiveness of RAG techniques in mitigating hallucination is
This paper addresses a significant research question of whether knowledge graphs (KGs) can make large language models (LLMs) more reliable in open-ended question answering. This paper designed a new benchmark, OKGQA, specifically for assessing LLMs enhanced with KGs in open-ended, real-world question answering scenarios. By proposing the OKGQA-P experimental setup, this paper considers scenarios where KGs may have varying levels of errors, further simulating real-world situations where KGs' qua
While OKGQA-P considers errors in KGs, further exploration of the generalizability of these findings to a broader range of real-world applications may be necessary. Although the authors present experimental results, a more in-depth analysis and discussion on why certain methods outperform others and the potential limitations of these methods could be provided.
1. The authors elaborate on the necessity of evaluating LLMs under open-ended QA scenarios in detail. 2. The OKGQA-P setting to assess KG-augmented LLMs when KGs are contaminated is intuitive and reasonable. 3. The perturbation results in Figures 4 and 5 are very interesting.
1. The observation (2) in line 76 that CoT and SC may cause bias and hallucination is confusing. The results (Llama3.1 8B) in Table 1 show that the integration of CoT and SC helps improve quality and factuality. 2. The authors may want to provide more statistical details including the involved people's educational background, and the final score distribution agreement between the people and LLMs in "The human-in-the-loop process in Line 186". 3. The authors may need to do more kinds of perturb
Videos
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Semantic Web and Ontologies
MethodsFocus
