# Addressing accuracy and hallucination of LLMs in Alzheimer's disease research through knowledge graphs

**Authors:** Tingxuan Xu, Jiarui Feng, Justin Melendez, Kaleigh Roberts, Donghong Cai, Mingfang Zhu, Donald Elbert, Yixin Chen, Randall J. Bateman

arXiv: 2508.21238 · 2025-09-01

## TL;DR

This study evaluates the effectiveness of GraphRAG systems in improving the accuracy and traceability of LLMs like GPT-4o for Alzheimer's disease research by constructing a specialized knowledge base and comparing response quality.

## Contribution

It introduces a comprehensive Alzheimer's disease knowledge base for GraphRAG, compares its performance with standard GPT-4o, and assesses traceability, advancing domain-specific LLM applications.

## Key findings

- GraphRAG improves response accuracy over standard GPT-4o.
- Enhanced traceability in GraphRAG aids scientific research.
- The provided interface facilitates testing of LLMs in biomedical domains.

## Abstract

In the past two years, large language model (LLM)-based chatbots, such as ChatGPT, have revolutionized various domains by enabling diverse task completion and question-answering capabilities. However, their application in scientific research remains constrained by challenges such as hallucinations, limited domain-specific knowledge, and lack of explainability or traceability for the response. Graph-based Retrieval-Augmented Generation (GraphRAG) has emerged as a promising approach to improving chatbot reliability by integrating domain-specific contextual information before response generation, addressing some limitations of standard LLMs. Despite its potential, there are only limited studies that evaluate GraphRAG on specific domains that require intensive knowledge, like Alzheimer's disease or other biomedical domains. In this paper, we assess the quality and traceability of two popular GraphRAG systems. We compile a database of 50 papers and 70 expert questions related to Alzheimer's disease, construct a GraphRAG knowledge base, and employ GPT-4o as the LLM for answering queries. We then compare the quality of responses generated by GraphRAG with those from a standard GPT-4o model. Additionally, we discuss and evaluate the traceability of several Retrieval-Augmented Generation (RAG) and GraphRAG systems. Finally, we provide an easy-to-use interface with a pre-built Alzheimer's disease database for researchers to test the performance of both standard RAG and GraphRAG.

---
Source: https://tomesphere.com/paper/2508.21238