Patient Perceptions of Artificial Intelligence-Generated Kidney Transplant Information: Comparing ChatGPT With the National Kidney Foundation
Hwarang Stephen Han, Jihye Lee

TL;DR
Patients with chronic kidney disease preferred AI-generated information about kidney transplants over traditional sources, suggesting AI could enhance patient education when used with professional guidance.
Contribution
This study is the first to compare patient perceptions of AI-generated transplant information with that from a trusted health organization.
Findings
Participants preferred ChatGPT responses over National Kidney Foundation responses in 81.3% of comparisons.
ChatGPT was rated higher in information quality, empathy, and learning outcomes.
Findings suggest AI can present transplant information in patient-friendly ways.
Abstract
Generative artificial intelligence (AI) may help patients better understand the complexities of kidney transplantation. However, little is known about how individuals with chronic kidney disease (CKD) perceive AI-generated health information. This study assessed patient perceptions of AI-generated responses to common kidney transplant queries compared to those from a trusted health resource. A cross-sectional online survey. A total of 216 adults with CKD, including kidney transplant recipients, residing in the United States participated in the study. Participants compared kidney transplant-related query responses generated by ChatGPT (GPT-4o), a widely used generative AI tool, with those provided by the National Kidney Foundation (NKF). Participant perceptions across several domains: overall preference, perceived information quality, empathy, and learning outcomes. Participants…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
Generative artificial intelligence (AI) is rapidly transforming how people access information, make decisions, and navigate daily life. These AI-powered chatbots, built on large language models (LLMs) trained on large-scale text datasets and related structured data from books, websites, and other sources, can generate human-like responses to a wide range of questions and prompts.1 Although not specifically developed for health care, their ability to provide medical guidance is making generative AI tools increasingly valuable in clinical settings.2, 3, 4, 5, 6, 7 For example, studies have found that Chat Generative Pre-trained Transformer (ChatGPT), one of the popular generative AI services, has demonstrated the ability to pass the United States Medical Licensing Examination (USMLE) and can supply high-quality health information with a notable degree of empathy.8^,^9
The field of kidney transplantation presents a valuable opportunity for investigating the applications of generative AI in health care. Kidney transplant, a preferred treatment for patients receiving kidney replacement therapy, offers better long-term outcomes compared to dialysis.10 However, the transplant journey can be overwhelming for many patients, often presenting significant challenges.11^,^12 Patients tend to face complex medical decisions, undergo comprehensive pretransplant evaluations for candidacy, and continue posttransplant monitoring for potential complications for the life of the allograft.13 Kidney transplant centers and their multidisciplinary teams have been the primary sources of guidance, yet many patients and their support networks encounter barriers including limited time, emotional stress, geographic distance, and resource constraints.14^,^15 ChatGPT and other generative AI tools have the potential to complement traditional clinical care in the field of kidney transplantation.16, 17, 18, 19, 20 By delivering easily accessible, around-the-clock support and education, generative AI has the hope to enhance patient knowledge, involvement, and sense of control throughout the kidney transplant process.
Despite increasing interest in the potential of generative AI in kidney transplantation, little is known about how patients with chronic kidney disease (CKD) perceive and interpret its use in the context of kidney transplantation. This study addresses a critical gap by examining how patients with CKD perceive information provided by ChatGPT compared with that from the National Kidney Foundation (NKF), a widely trusted authority in kidney health education.21 From NKF’s website, we identified common kidney transplant-related queries and their official responses. Using the same queries as prompts, we generated corresponding responses from ChatGPT and asked participants to compare them with responses from NKF without knowing the source of each response. This comparative approach offers valuable insights into how generative AI is perceived in terms of quality and usefulness compared to traditional web-based information. It also reveals patient preferences, highlighting both the strengths and limitations of generative AI in kidney transplantation education, while identifying opportunities to enhance the information provided by established expert sources.
Materials and Methods
For this study, NKF’s kidney transplant educational webpage was used to develop prompts for ChatGPT and to serve as a benchmark for comparison.22 Four prompts were selected from NKF’s website, reflecting common queries related to kidney transplantation: ‘About kidney transplant,’ ‘Benefits of kidney transplant,’ ‘Risks of kidney transplant,’ and ‘Who can get a kidney transplant’. We first gathered NKF’s responses to the prompts. Using the same queries, we then generated corresponding transcripts from responses from ChatGPT (GPT-4o, accessed on May 22, 2025). To eliminate any potential bias from prior user interactions, we used an alias user account and cleared the chat history before entering each prompt into ChatGPT. To emulate a typical user experience, we did not manipulate ChatGPT’s parameters, including response length, temperature (which controls the randomness of the model’s output) or top_p sampling (which limits choices to the most probable next words), as changes to these parameters could influence the content, tone, and variability of its responses.23 Four physicians with expertise in kidney transplantation reviewed ChatGPT’s responses to assess their accuracy and clinical appropriateness. The complete responses from both NKF and ChatGPT are provided in the Supporting Information (Table S1).
Participants were recruited through Prolific, an online research panel provider known for delivering high-quality data in terms of participant attention, comprehension, and reliability.24 Participants who consented to the study completed a brief survey on demographic information (eg, age, gender, education, and household income) and assessed their attitudes toward AI (eg, ‘AI will make the world a better place,’ ‘I have strong negative emotions about AI’).25^,^26 Of the 237 participants initially recruited, 21 were excluded for failing an attention check intended to confirm they had reviewed both the prompts and responses. This resulted in a final analytic sample of 216 United States adults aged 18 to 78 (M = 39.41, SD = 13.55) with CKD, including individuals with a history of kidney transplantation. As shown in Table 1, nearly half (47.69%) reported having kidney disease but not being on dialysis, while 24.5% were currently treated with dialysis and 27.8% had received a kidney transplant. The sample was 53.7% female, 44% male, and 2.31% non-binary. The most represented age groups were 18-29 (27.8%) and 30-39 (28.2%). The sample overrepresented individuals with higher household incomes, with 44.4% reporting annual earnings of 50,00043 (19.9%) 99,99977 (35.6%) $100,000 or more96 (44.4%)Race/ethnicity American Indian or Alaska Native2 (0.9%) Asian or Asian-American5 (2.3%) Black or African-American60 (27.8%) Hispanic or Latino or Spanish Origin5 (2.3%) Some other race4 (1.9%) White140 (64.8%)Number of observations216 (100%)
Each participant was shown transcripts containing prompts paired with responses from ChatGPT or original content from NKF, presented in random order without revealing the source. NKF and ChatGPT responses were randomized per participant and per prompt to minimize potential ordering effects. To ensure data quality, participants were required to spend at least 30 seconds reading each transcript. For each transcript, participants evaluated responses across 3 dimensions: information quality, empathy, and perceived learning outcomes. Information quality (eg, “How would you rate the quality of the information provided?”) and empathy (eg, “How empathetic do you find the information provided?”) were assessed using 5-point Likert scales (1 = very low and 5 = very high), with higher values indicating greater information quality or empathy.8 Perceived learning outcomes were measured by participants’ agreement with two statements:28 “The response I just read helped me better understand kidney transplantation,” and “After reading the response, I felt more confident in my understanding of kidney transplantation.” These were also rated on 5-point Likert scales (1 = strongly disagree and 5 = strongly agree). Internal consistency for the two efficacy items was assessed using Pearson correlations for each response, which indicate how closely the items are related to each other. Correlations ranged from 0.69 to 0.82 across NKF and ChatGPT responses, indicating acceptable reliability for composite efficacy scores. Thus, 2 items were averaged to create a single composite score for perceived learning outcomes, with higher values indicating greater perceived learning. Finally, after reading 2 responses to the same prompt, participants were shown both responses side by side (Example shown in Fig 1) and asked to indicate which one they preferred (“Which response was better?”).8 Each participant reviewed a total of 8 transcripts, corresponding to 4 prompt-response pairs.Figure 1. Sample screenshot showing a transcript of a kidney transplant query. Response 1 represents the answer provided by the National Kidney Foundation (NKF), while Response 2 shows the corresponding response generated by ChatGPT.
We assessed the readability of ChatGPT’s responses compared with NKF materials, using the response length (measured in word count) and the Flesch–Kincaid grade-level metric. Next, we conducted χ^2^ tests to examine whether participants’ response preferences differed between NKF and ChatGPT across the 4 prompts. We then explored how these preferences related to participants’ perceptions of information quality, empathy, and learning outcomes by comparing ratings between NKF and ChatGPT. Because each participant evaluated 8 transcripts (4 prompt-response pairs), each participant contributed multiple ratings, resulting in repeated measurements. To account for this structure, we employed mixed-effects models, which adjusts for differences between individuals and prompts. The model also controlled for participants’ demographics (age, gender, race/ethnicity, education, and household income), general attitudes toward AI, and type of CKD. All statistical analyses were performed in R statistical software, version 4.1 (R Project for Statistical Computing). This study was approved by the institutional review board at the corresponding author’s university, and all data was collected in accordance with the approved research protocol.
Results
We began by assessing the accessibility and length of responses from NKF and ChatGPT. ChatGPT responses were longer (M = 219.0, SD = 120.87, min = 94, max = 345) compared with NKF materials (M = 164.8, SD = 18.62, min = 197, max = 241), but this difference did not reach statistical significance (V = 7, P = 0.63). The analysis of the Flesch–Kincaid grade-level metric, where higher scores indicate more difficult text, showed that ChatGPT responses had a higher average reading level (M = 17.12, SD = 6.83) than NKF responses (M = 9.85, SD = 1.30). However, a Wilcoxon signed-rank exact test indicated that this difference was not statistically significant (V = 9, P = 0.25).
Across all 864 comparisons (216 participants × 4 prompts), ChatGPT’s response was preferred in 81.3% of cases (95% CI, 78.7%-83.9%). A χ^2^ test confirmed that this overall preference for ChatGPT over NKF was statistically significant (χ^2^ = 337.5; P < 0.001). Each prompt, along with participants’ preferences between NKF and ChatGPT responses, is presented in Table 2.Table 2. Distribution of Participant Preferences across 4 Kidney Transplantation PromptsQ1: About kidney transplantQ2: Benefits of kidney transplantQ3: Risks of kidney transplantQ4: Who can get a kidney transplantTotalNKF74 (34.3%)25 (11.6%)31 (14.4%)32 (14.8%)162 (18.8%)ChatGPT142 (65.7%)191 (88.4%)185 (85.7%)184 (85.2%)702 (81.3%)Total216 (100%)216 (100%)216 (100%)216 (100%)864 (100%)
Table 3 presents the full results of the mixed-effects analyses for patient perceptions of information quality, empathy, and learning outcomes by source. In all cases, models controlled for patient demographics (eg, education and income), personal experience with kidney disease, and attitudes toward AI. For information quality, patients with CKD perceived ChatGPT’s responses as significantly higher in quality than NKF’s responses (b = 0.63, SE = 0.03; P < 0.001). The mean rating for ChatGPT’s responses was 4.33 (SD = 0.80, 95% CI [4.28-4.39]), whereas NKF’s responses received a lower average rating of 3.7 (SD = 0.98, 95% CI [3.64-3.77]), reflecting a 14.6% lower score on average compared with ChatGPT.Table 3. Effects of Source (ChatGPT vs National Kidney Foundation) on Perceived Information Quality, Empathy, and Learning Outcomes Among Patients with Chronic Kidney Disease: Mixed-Effect Analysis ResultsInformation QualityEmpathyLearning OutcomesBSEBSEBSEFixed effectsIntercept2.92a0.462.59a0.573.00c0.44Source: ChatGPT (vs NKF)0.63a0.030.31a0.040.6a0.03Control variables Age0.0020.003−0.0020.004−0.0010.003 Gender (Ref. Non-binary) Female0.040.080.040.100.070.08 Male0.250.26−0.070.330.150.26 Race/Ethnicity (Ref. Other) American Indian or Alaska Native−0.170.490.530.620.090.47 Asian−0.9b0.39−0.470.49–0.550.38 Black or African American−0.430.3−0.090.37–0.450.29 Hispanic, Latino, or Spanish Origin−0.030.380.330.48−0.280.37 White−0.410.290.0010.37−0.40.28 Education High School degree or less0.170.20.280.250.150.19 Some college (no degree)0.160.180.230.220.120.17 Associate degree−0.0030.20.250.250.020.19 Bachelor’s degree0.010.09−0.120.11−0.060.09 Income (Ref. 50,000−0.10.21−0.290.27−0.050.21 99,9990.130.2−0.20.250.150.2 200,9990.050.2−0.230.250.070.2 AI Attitudes0.22c0.050.25c0.070.23c0.05Kidney Disease (Ref. Has kidney disease but not on dialysis) Currently on dialysis0.180.100.32∗0.130.150.1 Has received kidney transplant0.25b0.10.38c0.120.180.1Random Effects VAR (Intercept Participants)0.250.410.24 VAR (Intercept Prompts)0.030.0010.01 Residual0.490.670.45Model Fit Indices Marginal R^2^0.170.090.17 Conditional R^2^0.470.440.46 AIC4,117.094,656.594,073 BIC4,242.554,782.054,199.09Number of Observations1,728Notes: Cell entries are mixed-effects model coefficients, when controlling for participants’ demographics (age, gender, race/ethnicity, education, and household income), attitudes toward AI, and types of kidney diseases. The number of observations comprises responses from 216 participants who evaluated transcripts across 4 distinct prompts from 2 sources (NKF vs ChatGPT; 1,728 = 216 × 4 × 2).Abbreviations: AIC, Akaike information criterion; BIC, Bayesian information criterion; B, estimates; Conditional R^2^ (variance explained by fixed and random effects); Marginal R^2^ (variance explained by fixed effects); Ref, reference category; SE, standard error; VAR, variance.aP < 0.01.bP < 0.05.cP < 0.001.
Table 3 further shows that ChatGPT’s responses received significantly higher empathy ratings compared to NKF (b = 0.31, SE = 0.04; P < 0.001). The mean empathy score for ChatGPT’s responses was 3.77 (SD = 1.08, 95% CI [3.7-3.85]), compared with 3.46 for NKF responses (SD = 1.03, 95% CI [3.39-3.53]), an 8.2% lower score on average compared with ChatGPT.
In terms of learning outcomes, participants once again rated ChatGPT’s responses more favorably than those from NKF (see Table 3). ChatGPT’s responses were perceived as significantly more educational than NKF’s responses (b = 0.6, SE = 0.03; P < 0.001). On average, participants gave ChatGPT responses a learning score of 4.29 (SD = 0.75, 95% CI [4.24-4.34]). In contrast, NKF responses received an average rating of 3.7 (SD = 0.95, 95% CI [3.63-3.76]), a 13.8% lower score in perceived educational value.
Discussion
Overall, our findings demonstrate a consistent and statistically significant patient preference for ChatGPT-generated responses over content from NKF. A substantial majority of participants (81.3%) favored ChatGPT’s responses. Participants rated ChatGPT’s responses higher in information quality and empathy and reported greater self-perceived learning gains compared with NKF materials. Exploratory analyses of the responses’ readability and length showed that ChatGPT responses were longer and had higher readability level than NKF materials, although these differences were not statistically significant. These results suggest that generative AI may present information in a way that better meets patient needs and expectations, potentially filling gaps left by traditional educational resources.
The findings presented should be considered within the context of several limitations. Our study relied on web-based panel recruitment, which may not fully represent the diverse populations served by various transplant centers. The selected cohort in this study had, on average, higher levels of income and education, which may be associated with greater literacy and a preference for more detailed information that ChatGPT provided. ChatGPT responses tended to be longer and more consistent in length (i.e., lower standard deviation) than NKF materials. Although this difference was not statistically significant, likely due to the small number of observations (4 pairs per source), longer responses may have offered the deeper explanations participants preferred and conveyed a greater sense of empathy. ChatGPT outputs were also written at a higher grade level, but this difference was similarly not statistically significant. NKF materials appear to be designed for accessibility, as reflected by their Flesch–Kincaid grade-level score (M = 9.85, corresponding to a 9th-10th grade reading level). Their shorter responses may also be better suited for patients with CKD, who often have lower health literacy. Although we attempted to account for demographic imbalances by including education, income, and other variables as controls, representativeness remains limited. Nevertheless, including participants from 38 US states provides valuable insights into perceptions of ChatGPT in the kidney transplant process, and future research should validate these findings in more diverse populations.
ChatGPT’s responses were also evaluated using only four prompts from NKF. This restriction was intended to ensure consistency and focus, allowing participants to provide more reliable and uniform assessments. However, the limited number of prompts may not adequately reflect the broad spectrum of scenarios encountered in kidney transplantation. Additionally, we were unable to examine the underlying factors driving these preferences, such as language features (eg, use of emojis), emotional tone, or interface navigability. Future research should clarify these distinctions.
We selected NKF as a reference point because it represents the type of educational material patients commonly access online, providing a clinically meaningful basis for comparing AI-generated content with resources patients are already likely to use. Although NKF materials are developed and reviewed by clinical experts, they are written for broad patient education rather than as targeted expert-level responses. As such, they do not replace a true human-expert comparator for the specific prompts used in this study.
Most research on generative AI in healthcare has primarily focused on clinical expert evaluations of its accuracy and utility.8^,^19^,^20^,^29, 30, 31, 32, 33, 34 Although expert validation of generative AI in health care is essential, understanding patients’ perspectives is equally important, as they have a deeply personal stake in their care and often have informational needs that differ significantly from those of clinicians. One study examining patient perceptions of generative AI responses to common health questions found that ChatGPT was perceived as more empathetic and useful than physicians responding on web-based forums.35 However, patient’s perceptions of generative AI use remain underexplored in the field of kidney transplantation. Another study suggests that the perceptions of ChatGPT’s information quality in kidney transplantation vary according to general users’ racial/ethnic and educational backgrounds, highlighting potential disparities and important considerations for implementing generative AI tools in health care education.36 This highlights the urgent need to understand how patients with CKD perceive AI-generated information related to kidney transplantation.
Each new version of ChatGPT has demonstrated better performance in answering kidney transplant related questions, suggesting continued improvement is likely.37 Although numerous generative AI tools are available, this study utilized ChatGPT because of its widespread popularity.38 Although newer models may offer greater accuracy and enhanced safety, survey responses were not re-collected using these versions. Exploratory analyses comparing GPT-4o and GPT-5.1 (accessed on November 15, 2025) responses in terms of response length and Flesch–Kincaid grade level suggested that the outputs between the GPT versions are generally comparable (see Table S2 in the Supporting Information). These findings support the utility of our study as a baseline and provide a foundation for future research evaluating newer versions (eg, GPT-5.2) of generative AI in kidney transplant. Future research should also investigate the use of alternative models (eg, Gemini, Meta AI, Claude, and Grok), and generative AI specifically designed for health care or transplantation. Different generative AI models may yield varied results due to differences in their training methods. Examining how health information is perceived when delivered by accessible, general-purpose generative AI models versus specialized platforms could offer valuable insights into the effectiveness and suitability of these tools for communicating medical knowledge.
Although not fully explored in this study, generative AI offers the advantage of a conversational format that encourages follow-up questions and fosters interactive engagement, which static webpages are less equipped to provide. In addition, because generative AI is trained on vast and diverse datasets extending far beyond a single website, their results may be broader and less constrained by the perspective of any single source. Educational website organizers may therefore consider incorporating generative AI tools to utilize these benefits. Furthermore, generative AI holds great potential for non-English speakers, as it can function as a real-time translator, enabling users to access information across language boundaries.39^,^40 Future research should explore full back-and-forth interactions between users and generative AI to gain deeper insight into patients’ needs and to better define the role of generative AI in health care education and communication.
Generative AI remains in its early stages of integrating into people’s daily lives, and its adoption in health care must proceed with caution.41 ChatGPT may outperform traditional web searches in aiding preliminary diagnosis but has potential for misinformation and confusion.42 It is important to note that generative AI can occasionally produce responses that are grammatically correct and seemingly plausible but factually inaccurate (hallucinations), which could misinform users if left unchecked.43 Although our expert review found no major inaccuracies in ChatGPT’s responses in this study, the limited scope of prompts prevents a comprehensive assessment of its accuracy across the full range of kidney transplant-related queries. Past studies have suggested that using ChatGPT to answer clinical questions can sometimes lead to inaccurate responses with the potential for harm, underscoring the need for human oversight.20^,^33 Therefore, we recommend using generative AI as a complementary tool to traditional educational resources and human expertise, rather than as a standalone solution.
Our study represents an initial step toward understanding how generative AI may enhance patient education and engagement in the context of kidney transplantation. Although transplant centers and their multidisciplinary teams should remain the primary source of most trusted information, generative AI can serve as a valuable supplement. By combining the accessibility and scalability of generative AI with expert human oversight, transplant teams may be able to deliver more engaging, empathetic, and effective informational support to patients navigating complex treatment decisions.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Thirunavukarasu A.J.Ting D.S.J.Elangovan K.Gutierrez L.Tan T.F.Ting D.S.W.Large language models in medicine Nat Med 29820231930194010.1038/s 41591-023-02448-837460753 · doi ↗ · pubmed ↗
- 2Javaid M.Haleem A.Singh R.P.Chat GPT for healthcare services: An emerging stage for an innovative perspective Bench Council Trans Benchmarks Stand Eval 31202310010510.1016/j.tbench.2023.100105 · doi ↗
- 3Haupt C.E.Marks M.AI-Generated Medical Advice—GPT and Beyond JAMA 329162023134910.1001/jama.2023.532136972070 · doi ↗ · pubmed ↗
- 4Lee P.Bubeck S.Petro J.Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine Drazen J.M.Kohane I.S.Leong T.Y.N Engl J Med 388202310.1056/NEJ Msr 2214184(13):1233-123936988602 · doi ↗ · pubmed ↗
- 5Harris E.Large language models answer medical questions accurately, but can’t match clinicians’ knowledge JAMA 3309202379210.1001/jama.2023.1431137548971 · doi ↗ · pubmed ↗
- 6Zaretsky J.Kim J.M.Baskharoun S.Generative artificial intelligence to transform inpatient discharge summaries to patient-friendly language and format JAMA Netw Open 732024 e 24035710.1001/jamanetworkopen.2024.0357 PMC 1092850038466307 · doi ↗ · pubmed ↗
- 7Gordon E.B.Towbin A.J.Wingrove P.Enhancing patient communication with chat-GPT in radiology: evaluating the efficacy and readability of answers to common imaging-related questions J Am Coll Radiol 212202435335910.1016/j.jacr.2023.09.01137863153 · doi ↗ · pubmed ↗
- 8Ayers J.W.Poliak A.Dredze M.Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum JAMA Intern Med 1836202358910.1001/jamainternmed.2023.183837115527 PMC 10148230 · doi ↗ · pubmed ↗
