Agreement Between AI and Nephrologists in Addressing Common Patient Questions About Diabetic Nephropathy: Cross-Sectional Study

Niloufar Ebrahimi; Mehrbod Vakhshoori; Seigmund Teichman; Amir Abdipour

PMC · DOI:10.2196/65846·May 2, 2025

Agreement Between AI and Nephrologists in Addressing Common Patient Questions About Diabetic Nephropathy: Cross-Sectional Study

Niloufar Ebrahimi, Mehrbod Vakhshoori, Seigmund Teichman, Amir Abdipour

PDF

Open Access

TL;DR

This study compares how well AI and kidney specialists answer common questions about diabetic kidney disease.

Contribution

The study evaluates agreement between AI models and nephrologists in answering patient questions about diabetic nephropathy.

Findings

01

AI models showed varying levels of agreement with nephrologists in answering patient questions.

02

The study highlights the potential of AI in supporting patient education about diabetic nephropathy.

Abstract

This research letter presents a cross-sectional analysis comparing the agreement between artificial intelligence models and nephrologists in responding to common patient questions about diabetic nephropathy.

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases1

diabetic nephropathy

Tables2

Table 1.. Distribution of answers according to each respondent.

Questions	Accuracy of answers
	ChatGPT-4, first round	ChatGPT-4, second round	Google Gemini	Nephrologist 1	Nephrologist 2
1. What is the gold standard for diagnosis of diabetic nephropathy?	Completely accurate	Completely accurate	Completely accurate	Completely accurate	Completely accurate
2. What is the current standard medication therapy for diabetic nephropathy?	Completely accurate	Completely accurate	Completely accurate	Completely accurate	Completely accurate
3. Can diabetic nephropathy be prevented?	Completely accurate	Relatively accurate	Completely accurate	Relatively accurate	Relatively accurate
4. Can tobacco use accelerate the progression of diabetic nephropathy?	Completely accurate	Relatively accurate	Completely accurate	Completely accurate	Completely accurate
5. How is the severity of diabetic nephropathy determined?	Completely accurate	Completely accurate	Relatively accurate	Relatively accurate	Completely accurate
6. How frequently should a patient be screened for diabetic nephropathy?	Relatively accurate	Completely accurate	Completely accurate	Relatively accurate	Relatively accurate
7. What are the risk factors for the development of diabetic nephropathy?	Completely accurate	Completely accurate	Completely accurate	Relatively accurate	Relatively accurate
8. What is the incidence of kidney failure in diabetic nephropathy?	Completely accurate	Relatively accurate	Completely accurate	Relatively accurate	Relatively accurate
9. When should dialysis begin in diabetic nephropathy?	Relatively accurate	Relatively accurate	Relatively accurate	Relatively accurate	Completely accurate
10. What is the most common cause of death in diabetic nephropathy?	Relatively accurate	Completely accurate	Relatively accurate	Completely accurate	Completely accurate

Table 2.. Interrater reliability indicesa across different respondents.

Respondents	ChatGPT-4, first round	ChatGPT-4, second round	Google Gemini	Nephrologist 1	Nephrologist 2
ChatGPT-4, first round
κ	—^b	−0.08	0.52	0.07	−0.08
P value	—	.78	.10	.78	.78
ChatGPT-4, second round
κ	−0.08	—	−0.08	0.23	0.16
P value	.78	—	.78	.43	.60
Google Gemini
κ	0.52	−0.08	—	0.07	−0.52
P value	.10	.78	—	.78	.09
Nephrologist 1
κ	0.07	0.23	0.07	—	0.61
P value	.78	.43	.78	—	.04
Nephrologist 2
κ	−0.08	0.16	−0.52	0.61	—
P value	.78	.60	.09	.04	—

Keywords

artificial intelligencediabetic nephropathynephrologistChatGPTGoogle Gemini

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Artificial Intelligence in Healthcare · Machine Learning in Healthcare

Full text

Introduction

Diabetic nephropathy (DN) is one of the most frequent and severe complications of diabetes, requiring early detection and management [1]. Patients with diabetes should receive accurate information from health care professionals on preventing kidney disease. However, many turn to artificial intelligence (AI) models, like ChatGPT and Google Gemini, for web-based medical information [2-4]. To evaluate the capabilities of ChatGPT-4 and Google Gemini versus nephrologists in providing accurate DN information, their performance in answering the DN-related questions most commonly raised by patients was assessed.

Methods

Collection of Questions

To generate patient-focused questions, the following query was prompted to AI models: “What are the most frequently asked questions by individuals regarding diabetic nephropathy?”

The AI-generated responses were systematically reviewed. The final question set was refined and adjusted based on the principal investigator’s experience in clinical practice, ensuring alignment with common patient concerns encountered in real-world practice.

Ultimately, 10 questions covering various DN aspects were developed. Questions 1, 3, and 7 were used to evaluate DN’s diagnosis, risk factors, and prevention, respectively.

Questions 2, 6, and 9 were used to evaluate DN management. Questions 8 and 10 were included to assess DN complications. To evaluate DN progression and severity, questions 4 and 5 were selected.

Collecting Chatbot and Nephrologist Responses

To ensure consistency, a single investigator entered all questions into ChatGPT-4 and Google Gemini between May 23 and July 7, 2024. Each question was entered into ChatGPT-4 twice—initially and after 45 days—to assess changes in accuracy over time. Google Gemini was used once—concurrently with the second ChatGPT-4 round—and was limited to short-response tasks. Two experienced faculty nephrologists from Loma Linda University with clinical and academic experience also completed the questionnaire via a Google Forms survey.

Evaluation of Chatbot and Nephrologist Responses

An independent reviewer—a professor of medicine from the same academic center—evaluated AI and nephrologists’ responses. Each answer was graded as “completely inaccurate,” “relatively inaccurate,” “irrelevant,” “relatively accurate,” or “completely accurate.” To prevent grading bias, the reviewer was not informed about the nephrologists’ identities.

Statistical Analysis

Analyses were conducted by using RStudio (version 4.3.0; RStudio Inc), with P values of <.05 considered significant.

Ethical Considerations

As no patient data were involved, ethical approval was not required. This study adhered to ethical principles for research integrity and transparency.

Results

Table 1 presents the accuracy distribution of responses for each question assessed by reviewers. No responses were categorized as irrelevant or inaccurate; all were rated as relatively or completely accurate.

Table 2 summarizes the interrater reliability indices among different respondents. The two nephrologists showed statistically significant agreement (κ=0.61; P=.04). ChatGPT-4 and Google Gemini had moderate but nonsignificant agreement (κ=0.52; P=.10). No significant agreement was found between either AI and the nephrologists (all P values were >.05). ChatGPT-4 responses lacked consistency over time (κ=−0.08; P=.78). Further analysis showed negligible, nonsignificant agreement among all respondents (κ=0.083; P=.41). Excluding ChatGPT-4’s second-round responses did not alter the results (κ=0.09; P=.45), confirming the lack of significant agreement.

Discussion

We found that AI models generally provided accurate responses to DN-related questions, with moderate agreement on their accuracy among nephrologists. However, agreement between AI outputs and nephrologists’ assessments was minimal, indicating a lack of standardized evaluation or clinical alignment. Further, the moderate concordance between ChatGPT-4 and Google Gemini suggests similar underlying approaches, and the improved agreement in ChatGPT-4’s second round indicates potential learning and adaptability; however, their limited alignment with nephrologists raises concerns regarding their clinical applicability. Despite that, interactive AI potentially enhances clinical processes by supporting patient education and facilitating communication between patients and clinicians regarding typical disease prevention–related queries [6]; the more questions lean toward subspecialties, the less accurate AI responses tend to be [7].

Although AI models can offer helpful responses about DN, they are not substitutes for thorough clinical discussions, due to observed inconsistencies. Given this study’s preliminary nature, findings should be interpreted cautiously. Further research with larger datasets is warranted to evaluate AI’s reliability in clinical use.

This study has several limitations. The AI models used were not specifically designed for medical applications, and the free versions, which we intentionally selected to reflect typical patient use, may underperform when compared to premium versions. Moreover, including only 2 nephrologists limits the diversity of clinical perspectives, and evaluations by a single senior nephrologist may introduce bias; future studies should include multiple reviewers to strengthen evaluation reliability and validity. Lastly, we did not assess AI responses’ clarity or helpfulness from the patient perspective, highlighting the need for user-centered evaluations in future research.

Bibliography8

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Samsu N Diabetic nephropathy: challenges in pathogenesis, diagnosis, and treatment Biomed Res Int 078202120211497449 doi 10.1155/2021/1497449 Medline 34307650 PMC 8285185 · doi ↗ · pubmed ↗
2Miao J Thongprayoon C Cheungpasitporn W Assessing the accuracy of Chat GPT on core questions in glomerular disease Kidney Int Rep 052620238816571659 doi 10.1016/j.ekir.2023.05.014Medline 37547515 PMC 10403654 · doi ↗ · pubmed ↗
3Chat GPT — release notes Open AIUR Lhttps://help.openai.com/en/articles/6825453-chatgpt-release-notes Accessed 28-04-2025
4Gemini Apps’ release updates & improvements Gemini Advanced UR Lhttps://gemini.google.com/updates Accessed 30-04-2025
5Mc Hugh ML Interrater reliability: the kappa statistic Biochem Med (Zagreb)Oct 152012223276282 doi 10.11613/BM.2012.03123092060 PMC 3900052 · doi ↗ · pubmed ↗
6Sarraju A Bruemmer D Van Iterson E Cho L Rodriguez F Laffin L Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model JAMA Mar 14202332910842844 doi 10.1001/jama.2023.1044 Medline 36735264 PMC 10015303 · doi ↗ · pubmed ↗
7Caranfa JT Bommakanti NK Young BK Zhao PY Accuracy of vitreoretinal disease information from an artificial intelligence chatbot JAMA Ophthalmol Sep 120231419906907 doi 10.1001/jamaophthalmol.2023.3314 Medline 37535363 PMC 10401388 · doi ↗ · pubmed ↗
8Certification - ABAIM The American Board of Artificial Intelligence in Medicine UR Lhttps://abaim.org/certification Accessed 28-04-2025