Large Language Models as Clinical Nutrition Decision Tools: Quantitative Bias and Guideline Deviation in Type 2 Diabetes Meal Planning

Pinar Ece Karakas; Aysenur Calik; Ayse Betul Bilen; Kardelen Kandemir; Muveddet Emel Alphan

PMC · DOI:10.3390/healthcare14060739·March 13, 2026

Large Language Models as Clinical Nutrition Decision Tools: Quantitative Bias and Guideline Deviation in Type 2 Diabetes Meal Planning

Pinar Ece Karakas, Aysenur Calik, Ayse Betul Bilen, Kardelen Kandemir, Muveddet Emel Alphan

PDF

Open Access

TL;DR

AI-generated meal plans for type 2 diabetes differ from expert diets, showing bias in energy and fiber, and need expert evaluation before use.

Contribution

This study quantitatively evaluates the bias and guideline adherence of large language models in generating dietary plans for type 2 diabetes.

Findings

01

LLM-generated diets showed systematic deviations from dietitian-designed reference diets in energy and macronutrient content.

02

AI meal plans lacked sufficient fiber and showed limited individualized medical nutrition therapy.

03

Only one of the tested models showed relatively consistent guideline adherence.

Abstract

What are the main findings? Large language models generated dietary plans for type 2 diabetes that differed substantially from a guideline-based, dietitian-designed reference diet, particularly in energy intake and dietary fiber adequacy.Most AI-generated diets followed a low-energy, lower-carbohydrate, higher-protein, and insufficient-fiber pattern, with limited evidence of individualized medical nutrition therapy. Large language models generated dietary plans for type 2 diabetes that differed substantially from a guideline-based, dietitian-designed reference diet, particularly in energy intake and dietary fiber adequacy. Most AI-generated diets followed a low-energy, lower-carbohydrate, higher-protein, and insufficient-fiber pattern, with limited evidence of individualized medical nutrition therapy. What are the implications of the main findings? AI-generated dietary plans for…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals1

carbohydrate

Diseases3

type 2 diabetes T2DM diabetes

Figures3

Click any figure to enlarge with its caption.

Keywords

artificial intelligencelarge language modelstype 2 diabetesmedical nutrition therapydietary planningguideline adherence

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDietetics, Nutrition, and Education · Diet and metabolism studies · Nutrition, Genetics, and Disease

Full text

1. Introduction

Diabetes is a chronic metabolic disease affecting more than half a billion people worldwide and is associated with a rapidly increasing prevalence, as well as substantial morbidity, mortality, and healthcare expenditures. The global rise in physical inactivity and sedentary lifestyles, the widespread adoption of energy-dense and nutrient-poor dietary patterns commonly referred to as the Western diet, and the growing aging population are among the primary drivers of the increasing incidence of type 2 diabetes mellitus (T2DM) [1]. Medical nutrition therapy (MNT) is a cornerstone of T2DM management and plays a critical role in achieving glycemic control, reducing cardiometabolic risk factors, and improving quality of life [2]. International diabetes guidelines emphasize that MNT for T2DM should focus on regulating both the amount and quality of carbohydrate intake, increasing dietary fiber consumption, ensuring an appropriate distribution of healthy fatty acids, and maintaining energy balance, while also highlighting the importance of individualized and sustainable nutrition therapy [3]. However, factors such as time constraints, limited access to dietitians, patient preferences, and educational level may hinder the translation of evidence-based recommendations into individualized, sustainable, and feasible dietary plans in clinical practice, thereby reducing the effectiveness of MNT [4]. Against this background, the growing need for accessible, timely, and individualized nutrition support has prompted increasing interest in scalable digital tools, including large language models (LLMs), as potential aids for meal planning and dietary guidance [5,6].

In recent years, LLMs have gained increasing attention in healthcare, including nutrition, because of their ability to generate rapid, detailed, and user-responsive recommendations. In nutrition practice, these tools have attracted considerable interest from both healthcare professionals and the public, particularly because of their potential to generate meal plans, summarize dietary guidance, and provide apparently personalized recommendations based on user prompts [5,6]. However, important concerns remain regarding the reliability of artificial intelligence (AI)-generated dietary advice. Across health-related applications, LLMs have been reported to produce quantitatively inaccurate outputs, hallucinated content, inconsistent recommendations, and responses that may drift from evidence-based clinical guidelines. In chronic conditions such as T2DM, where dietary management requires sustained accuracy, balance, and individualization, such limitations may reduce clinical usefulness and potentially introduce patient safety risks [7,8,9].

In the literature, studies that systematically evaluate dietary plans generated by different LLMs in accordance with the principles of MNT for T2DM are limited. In particular, relatively few studies have examined whether such plans are quantitatively aligned with evidence-based nutritional targets and clinically applicable when compared with an expert dietitian-designed reference diet. Therefore, the aim of this study was to evaluate dietary plans generated by different LLMs within the framework of MNT for T2DM in terms of nutrient adequacy, compliance with current dietary guidelines, and clinical applicability, and to compare the obtained outputs with a reference diet plan prepared by an expert dietitian based on up-to-date, evidence-based guidelines. By combining nutrient-level assessment with expert evaluation, this study aims to contribute to ongoing discussions on the safe, evidence-based, and clinically responsible use of AI-based tools in nutrition and diabetes management.

2. Materials and Methods

2.1. Study Design

This study was designed as a quantitative agreement and bias analysis to evaluate the clinical safety and guideline concordance of LLMs when used as nutrition decision support tools in T2DM. Three-day dietary plans generated by widely used LLMs were compared with a dietitian-designed, guideline-concordant reference diet. Nutrient composition was quantitatively analyzed, and agreement was assessed to determine clinical interchangeability. In addition, the adherence of AI-generated dietary recommendations to national and international diabetes nutrition guidelines was independently evaluated by registered dietitians using a structured scoring system. Details of the study design and analytical workflow are presented in Figure 1 and the following subsections.

2.2. Generation of AI-Based Dietary Plans

For this study, six versions of three widely used LLMs were selected: ChatGPT-5 Instant [10], ChatGPT-5 Auto [11], ChatGPT-5 Thinking [12], Claude Sonnet 4.5 [13], Gemini Flash 2.5 [14], and Gemini Pro 2.5 [15]. The selected models were chosen to represent different levels of accessibility (free and paid), distinct inference modes (rapid response, automatic, and advanced reasoning), and multiple developer platforms. This approach was intended to enable a more comprehensive and representative evaluation of LLM outputs under real-world usage conditions. To further minimize the influence of prior user interactions, a new and dedicated account was created for each platform, ensuring that model responses were not influenced by previous conversation history or personalized learning. Model-specific parameters such as temperature settings or system-level instructions were not manually adjusted, and all models were used under their default interface configurations as available to general users at the time of data collection.

All models were instructed using the following prompt (translated from Turkish): “Prepare a 3-day 1800 kcal diet plan for individuals with type 2 diabetes based on national and international nutrition guidelines. Also include the macronutrient and micronutrient values, their proportions, and energy contributions. In addition, provide detailed nutrition recommendations for type 2 diabetes.”

Each model generated a three-day dietary plan. Each day was treated as a separate observation, resulting in a total of 18 observations (6 models × 3 days) (Text S2: Full Version of Reference Diet and AI-Generated Diets).

A single standardized clinical scenario (fixed energy target and disease profile) was intentionally used to isolate intrinsic model performance and eliminate variability arising from patient heterogeneity. This approach allowed for a controlled assessment of systematic bias and agreement across models under identical clinical conditions.

All model interactions were conducted as single-turn queries in newly initiated independent sessions. No follow-up prompts, clarification requests, retries, or iterative refinements were allowed, and the first complete output from each model was retained for analysis. Incomplete or non-interpretable responses would have been excluded and documented as protocol deviations; no such cases were encountered in this study. No user-defined system prompts were used. Model parameters such as temperature or other generation settings were not visible or user-configurable for all platforms and therefore could not be standardized. Before BeBiS [16] analysis, AI-generated diet plans were converted into a standardized food-record format. Food items and portion descriptions were entered as closely as possible to the model-generated outputs using foods available in the BeBiS [16] database, and no food substitutions were required.

2.3. Reference Diet Preparation

The reference diet was designed by a registered dietitian (RD) with clinical experience in diabetes nutrition therapy, following the same specifications provided in the standardized AI prompt (three days, 1800 kcal) (Text S2: Full Version of Reference Diet and AI-Generated Diets). The diet was developed in accordance with current national and international evidence-based diabetes nutrition guidelines, including those issued by the International Diabetes Federation (IDF) [1], the Society of Endocrinology and Metabolism of Turkey (SEMT) [17], the American Diabetes Association (ADA) [3], and the ADA–European Association for the Study of Diabetes (EASD) consensus reports [2]. Although current guidelines do not specify a standard amount for energy and macronutrient intake and emphasize the need for individualization, they recommend that macronutrient distribution should provide 45–60% of total energy from carbohydrates, 10–20% from protein, and 20–35% from fat. In addition, recent guidelines recommend a minimum dietary fiber intake of 14 g per 1000 kcal of energy [3,17]. The reference diet was designed in accordance with current dietary guidelines and structured to include three main meals and three snacks. The diet was characterized by a high dietary fiber content and a predominant inclusion of vegetables, fruits, whole grains, and legumes, while the intake of refined carbohydrates, trans fats, and saturated fats was deliberately limited. The reference diet served as a guideline-concordant benchmark for comparison with the AI-generated dietary plans.

2.4. Nutrient Analysis of Dietary Plans

The energy and nutrient contents of the three-day dietary plans generated by the different LLMs were analyzed by the researchers using BeBiS 9.0 software [16]. The analyzed dietary components included total energy, carbohydrate, protein, fat, and dietary fiber. All food items proposed by the AI-generated diets were available in the BeBiS 9.0 [16] database; therefore, no food substitutions were required during the analysis.

Each day of the AI-generated dietary plans was treated as a separate dietary record, resulting in a total of 18 AI-generated diets. These individual daily diets were compared with the reference diet in the agreement analyses. In addition, mean values for energy and nutrient contents were calculated across all 18 AI-generated diets for comparative analyses.

Although the standardized prompt explicitly requested micronutrient values, a systematic qualitative review of model outputs revealed that the majority of LLMs either omitted micronutrient data entirely or reported only partial values for selected nutrients (e.g., sodium, calcium). Due to the inconsistent and incomplete nature of micronutrient reporting across models, a quantitative comparative analysis of micronutrient adequacy was not feasible. This limitation in model output itself is reported as a qualitative finding relevant to clinical safety.

2.5. Evaluation of Guideline Adherence and Dietary Recommendations

The content and accuracy of the dietary plans and accompanying dietary recommendations generated by the different large language models were evaluated based on national and international diabetes nutrition guidelines. These guidelines included those published by IDF, SEMT, ADA, and ADA-EASD consensus reports (Table 1).

Guideline adherence was assessed by examining key dietary domains derived from the recommendations, including energy appropriateness, macronutrient distribution and dietary fiber adequacy. The qualitative content of dietary recommendations was independently evaluated by registered dietitians using a structured scoring system, as described in the Statistical Analysis section.

2.6. Statistical Analysis

Statistical analyses were performed using IBM SPSS Statistics version 24.0 [18]. Descriptive statistics were calculated for energy and nutrient values derived from the BeBiS 9.0 [16] analyses of the 18 AI-generated dietary records and are presented as mean ± standard deviation.

As each AI-generated daily diet was directly compared with the dietitian-designed reference diet, paired-sample t-tests were used to assess differences in mean energy, carbohydrate, protein, fat, and dietary fiber contents. Prior to analysis, the normality of the differences between paired measurements was evaluated. Deviations of AI-generated diets from the reference diet were visually presented using box-and-whisker plots.

Agreement between the reference diet and AI-generated diets for energy, carbohydrate, protein, fat, and dietary fiber was assessed using Bland–Altman analysis, following the method described by Bland and Altman [19]. For each dietary component, the mean difference (bias) and the 95% limits of agreement (LOA) were calculated, and Bland–Altman plots were generated using the differences between each AI-generated daily diet and the reference diet against their means.

The appropriateness of dietary recommendations generated by the different language models in relation to national and international diabetes guidelines was independently evaluated by two registered dietitians with clinical experience in diabetes nutrition therapy. Recommendations were scored on a 10-point scale based on predefined criteria derived from the literature. Inter-rater reliability was assessed using the Intraclass Correlation Coefficient (ICC). A p-value < 0.05 was considered statistically significant for all analyses.

To assess potential within-model clustering of daily dietary outputs, sensitivity analyses were performed using linear mixed-effects models with model identity included as a random intercept. In addition, intraclass correlation coefficients (ICCs) were calculated to evaluate similarity among the three daily diets generated by each language model. These analyses were conducted to examine whether daily dietary plans could reasonably be treated as independent observations in the primary analyses. Detailed results are provided in the Supplementary Materials.

To facilitate clinical interpretation of the Bland–Altman results, clinically meaningful thresholds were considered based on established nutritional guidelines. A deviation of ≥10% from the target daily energy intake (equivalent to approximately ±180 kcal for an 1800 kcal diet) was considered clinically relevant. For dietary fiber adequacy was evaluated relative to guideline recommendations of approximately 14 g per 1000 kcal, corresponding to roughly 25–30 g/day for an 1800 kcal diet.

3. Results

3.1. Energy and Nutrient Content of AI-Generated Diets

The energy and nutrient composition of dietary plans generated by different LLMs under identical standardized conditions were quantitatively compared with a guideline-concordant reference diet designed by a registered dietitian. According to the BeBiS 9.0 analysis, the reference diet provided 48.0% of total energy from carbohydrates, 17.9% from protein, and 33.9% from fat, consistent with current diabetes nutrition recommendations.

Across models, systematic deviations from the target energy prescription were observed. Diets generated by Gemini 2.5 Pro, ChatGPT-5 Auto, and Gemini 2.5 Flash provided significantly lower total energy than the reference diet (p < 0.05), indicating underestimation of caloric targets under standardized conditions.

Carbohydrate content was consistently lower in most AI-generated diets compared with the reference diet, with statistically significant differences across all models except Claude Sonnet 4.5 (p < 0.05). This pattern reflects a systematic shift toward lower carbohydrate distribution relative to guideline recommendations.

Protein content demonstrated inter-model variability, with ChatGPT-5 Instant and ChatGPT-5 Thinking generating significantly higher protein amounts than the reference diet (p < 0.05). No statistically significant differences were detected for total fat content (p > 0.05).

Regarding dietary fiber, most LLM-generated diets contained lower fiber levels compared with the reference diet; however, Claude Sonnet 4.5 showed no statistically significant difference (p > 0.05). Overall, these findings indicate consistent quantitative discrepancies in key macronutrient targets across models (Table 2 and Table S1).

3.2. Deviations from the Reference Diet: Boxplot Analysis

Deviations in energy and nutrient contents of the three-day dietary plans generated by different LLM relative to the reference diet are presented using box-and-whisker plots in Figure 2. In terms of energy content, the diets generated by Claude Sonnet 4.5 exhibited the widest range of deviation from the reference diet, with values ranging approximately from −100 to +300 kcal. In contrast, Gemini 2.5 Flash showed a pronounced negative deviation in energy content, with values around −500 kcal. For the remaining models, energy deviations were predominantly distributed between −200 and −400 kcal.

All LLMs demonstrated negative deviations in carbohydrate content compared with the reference diet. The greatest reductions in carbohydrate content were observed in Gemini 2.5 Pro and Gemini 2.5 Flash, with deviations ranging approximately from −90 to −120 g. Protein content showed smaller deviations overall, with diets generated by ChatGPT-5 Auto providing protein amounts closest to the reference diet, typically within a range of −5 to +5 g.

Regarding fat content, Gemini 2.5 Flash exhibited the largest negative deviations, with fat intake approximately 15–25 g lower than the reference diet. In contrast, Claude Sonnet 4.5 showed a consistent positive deviation in fat content, with values approximately 10–20 g higher than the reference diet, while the remaining models displayed fat contents broadly comparable to the reference diet.

With respect to dietary fiber, all LLMs except Claude Sonnet 4.5 generated diets with lower fiber content than the reference diet. Fiber deviations for most models were consistently negative and generally ranged from approximately −10 to −20 g, whereas diets generated by Claude Sonnet 4.5 were centered around the reference value and, in several cases, exhibited modest positive deviations. These distribution patterns are consistent with the mean differences reported in Table 2.

3.3. Agreement Between AI-Generated Diets and the Reference Diet

Agreement between the three-day dietary plans generated by the LLMs and the reference diet was assessed using Bland–Altman analysis, with the corresponding plots presented in Figure 3. For energy intake, the mean difference relative to the reference diet was 207.7 kcal, with 95% LOA ranging from −233.6 to 649.0 kcal, indicating a systematic tendency of AI-generated diets to provide lower total energy compared with the reference diet. Considering a clinically relevant threshold of ±10% deviation from the target energy intake (180 kcal for an 1800 kcal diet), several AI-generated plans exceeded this range, indicating potential clinical implications for glycemic management and dietary planning. Across energy and macronutrient components, the proportion of observations falling outside the 95% limits of agreement ranged from 5.6% to 11.1%. The proportions of observations outside the 95% LOA for energy and macronutrients are presented in Supplementary Table S2.

The mean differences and limits of agreement for individual macronutrients were as follows:

Carbohydrate, +77.1 g (95% LOA: 9.6 to 144.6 g); protein, −12.7 g (95% LOA: −47.6 to 22.0 g); fat, −2.7 g (95% LOA: −31.9 to 26.5 g); and dietary fiber, +9.6 g (95% LOA: −7.2 to 26.4 g).

Overall, Bland–Altman plots demonstrated the presence of observations outside the limits of agreement for all evaluated nutrients, reflecting systematic discrepancies between AI-generated dietary plans and the dietitian-designed reference diet, particularly for energy and carbohydrate content.

3.4. Dietitian-Based Evaluation of Guideline Adherence

Dietitian-based evaluations of the three-day dietary plans generated by the LLMs are presented in Table 3. Inter-rater reliability between the two dietitians was assessed using the ICC, which indicated a high level of agreement (ICC = 0.806; 95% confidence interval: 0.654–0.896; p < 0.001).

Across evaluation domains, diets generated by Claude Sonnet 4.5 consistently received higher scores for guideline adherence, appropriateness of meal distribution, and accuracy of reported energy and macronutrient values. In contrast, Gemini 2.5 Flash and Gemini 2.5 Pro received lower scores across most evaluation criteria, particularly in guideline adherence, meal distribution, and consistency between stated and calculated nutrient values. ChatGPT-5 variants demonstrated intermediate performance, with relatively higher scores for clarity of dietary recommendations but greater variability across evaluative domains.

Sensitivity analyses using linear mixed-effects models showed negligible random intercept variance for most nutrients, indicating minimal clustering of daily outputs within individual models. Intraclass correlation coefficients ranged from −0.073 to 0.642, suggesting low to moderate similarity across daily diets generated by the same model (Supplementary Tables S4 and S5).

3.5. Micronutrient Reporting

Although the standardized prompt explicitly requested micronutrient values, a systematic review of model outputs revealed that most LLMs reported only target or average values for selected micronutrients (most commonly sodium and calcium) rather than day-specific calculated amounts. Day-level micronutrient data, which would be required for a valid quantitative comparison analogous to the macronutrient analyses, were not provided by any model. Consequently, a systematic micronutrient adequacy analysis was not feasible, and this finding is itself reported as a qualitative limitation of current LLM outputs in clinical nutrition practice.

4. Discussion

Advances in artificial intelligence technologies have led to a rapid expansion of AI applications across various fields, including nutrition and dietetics. AI-based tools for providing nutrition advice and generating dietary plans have become an increasingly discussed topic in both academic and clinical contexts with the widespread adoption of LLMs [20]. Diabetes is one of the most frequently discussed disease conditions in interactions with LLMs. LLMs are used in diabetes management for purposes such as meal planning, providing personalized recipe recommendations, and estimating insulin doses [21]. Nevertheless, individualized MNT, planned and monitored by a RD, remains an integral component of glycemic control and the management of T2DM [22]. The increasing use of LLMs in diabetes management highlights the need to validate the reliability and clinical appropriateness of the recommendations provided by these models.

In national and international guidelines, the primary emphasis in the medical nutrition therapy of T2DM is placed not on macronutrient distribution percentages, but rather on food quality, degree of processing, dietary diversity, and sustainability [3]. Moreover, systematic reviews and meta-analyses published in recent years indicate that high adherence to healthy dietary patterns is associated with significant improvements in HbA_1c_, fasting plasma glucose, and cardiovascular risk factors [23,24]. In this study, diets for T2DM generated by commonly used LLMs were compared with a reference diet prepared by a dietitian working in a diabetes outpatient clinic in accordance with current diabetes guidelines, in terms of energy and macronutrient content, and the adherence of AI-generated diets to these guidelines was evaluated.

The results suggest that LLMs can generate structured dietary plans for diabetes when provided with a standardized prompt; however, notable limitations remain regarding quantitative accuracy and clinical relevance. In this study, the diets produced by LLMs did not meet the specified energy targets and generally provided lower energy compared with the reference diet prepared by a dietitian. Diets generated by ChatGPT-5 Auto, Gemini 2.5 Flash, and Gemini 2.5 Pro contained less energy than requested and showed poor alignment with the reference diet. By contrast, the diet generated by the Claude Sonnet 4.5 model was closer to both the prescribed energy level and the reference diet. Consistent with these findings, previous research evaluating AI-generated diets designed for weight management across different sexes and energy requirements reported that, while overall diet quality was acceptable, macronutrient distribution was often inappropriate. In that study, consistent with the findings of this study, AI-generated diets, including those produced by Gemini 2.5. and ChatGPT-4, exhibited deviations in energy content ranging from 5% to 20% [25]. In this study, differences in macronutrient distribution were also observed among the models. The diet generated by Claude Sonnet 4.5 showed the closest alignment with the reference diet in terms of both energy and dietary fiber content. These findings suggest that factors such as LLM architecture, the scope of training data, and the way instructions are interpreted may influence diet quality, leading to clinically meaningful performance differences among language models in the context of clinical nutrition.

In the diets generated by LLMs, the percentage of energy derived from carbohydrates does not align with the range of 45–60% recommended by diabetes nutrition guidelines [1,3,17,24] (Supplementary Table S1). Deviations in carbohydrate content showed a systematic reduction in the Gemini 2.5 Flash and Gemini 2.5 Pro models. Consistent with our findings, the study conducted by Bayram et al. reported that AI-generated diets frequently exhibit carbohydrate contents that are not aligned with current dietary guidelines [7]. This tendency toward lower carbohydrate content in AI-generated diets may be related to the increasing popularity of low-carbohydrate dietary approaches in recent years.

Current guidelines recommend increasing dietary fiber intake primarily through natural sources such as whole grains, legumes, vegetables, fruits, and nuts, while limiting dietary patterns rich in refined carbohydrates and low in fiber [3,17,24]. In the ADA Standards of Care in Diabetes 2026, the importance of high-quality, minimally processed, and fiber-rich carbohydrate sources is further emphasized, regardless of the absolute amount of carbohydrate included in the meal plan [3]. In the present study, all AI models except Claude Sonnet 4.5 generated diets containing lower amounts of dietary fiber compared with the reference diet. Similarly, a study comparing AI-generated healthy diets with dietitian-designed diets reported that the reference diet had a higher carbohydrate content, was more compliant with the prompt, and was richer in dietary fiber, consistent with our findings [26]. Overall, the low dietary fiber content of AI-generated diets may be considered a major limitation to the ability of LLMs to produce reliable and guideline-adherent nutrition plans for individuals with T2DM. From a clinical perspective, interpretation of agreement between AI-generated diets and the reference diet should also consider meaningful nutritional thresholds. For example, deviations of approximately ≥ 10% in daily energy intake may influence glycemic control and weight management in individuals with type 2 diabetes. Similarly, dietary fiber intake recommendations suggest approximately 14 g per 1000 kcal, which corresponds to roughly 25–30 g/day for an 1800 kcal diet. Some AI-generated diet plans showed deviations from these targets, which may have potential implications for long-term metabolic outcomes.

In our study, a more heterogeneous distribution in protein content was observed across models, with several models producing diets higher in protein than the reference diet. Similarly, a study investigating the reliability and accuracy of diets generated by LLMs reported deviations of up to ±65 g in protein content between different models and found that LLMs were unable to consistently achieve appropriate macronutrient distributions [27]. This finding suggests that the increasing use of LLMs may lead to recommendations that are incompatible with diabetes-related complications such as nephropathy and underscores the need for individualized clinical evaluation in the presence of comorbid conditions accompanying diabetes.

In this study, the Claude Sonnet 4.5 model systematically generated diets with higher fat content, whereas the Gemini 2.5 Flash model produced diets with lower fat content. In a study by Bayram et al., which evaluated energy calculations, nutrient distribution, and adherence to nutrition care process standards using 24 different virtual patient profiles, ChatGPT-4 generated diets with energy contents closest to the target; however, all included AI models were found to produce diets with excessively high fat content [7]. In a study evaluating diets generated by ChatGPT-4 for dyslipidemia and hypertension, the model was reported to perform well in providing general health recommendations; however, it produced diets characterized by low carbohydrate (24%) and high fat (54%) content. Additionally, despite successful recommendations, the diets were found to be high in sodium, calcium, and cholesterol. Consistent with our findings, these results confirm that while LLMs may provide disease-appropriate recommendations, they can generate incongruent outputs in actual diet planning. The same study also reported that when a fixed energy target was provided to the language model, it produced diets more closely aligned with the specified energy level compared with prompts without an explicit energy target [28]. However, in our study, LLMs were unable to generate diets consistent with the specified energy target despite the provision of an explicit energy value. Systematic discrepancies were observed between LLM-generated diets and the reference diet for both energy and macronutrients. In particular, the wide limits of agreement observed for energy and carbohydrate intake indicate that AI-generated outputs should be interpreted with caution in individual clinical practice. Although some LLMs were able to produce diets with macronutrient distributions close to the reference diet, inconsistencies in energy and nutrient content, together with the lack of consideration of individualized requirements and comorbidities, demonstrate that AI-based dietary plans do not yet provide a level of accuracy and appropriateness sufficient to replace dietitian-guided nutrition therapy.

In a study comparing patient education materials generated by expert physicians and LLMs for diabetic kidney disease, some LLMs were found to produce materials of similar quality to those prepared by physicians; however, models such as ChatGPT-4 were reported to be insufficient in terms of understandability and accuracy [29]. Similarly, in the study conducted by Bayram et al. using T2DM patient profiles, recommendations related to complications such as hypoglycemia were found to be insufficient [7]. Therefore, consistent with the findings of our study, it is emphasized that health-related recommendations generated by LLMs should always be reviewed and validated by qualified healthcare professionals [7,29]. In a study evaluating the adherence of dietary recommendations for heart health generated by different LLMs to scientific guidelines, the models were found to provide generally appropriate and reliable advice; however, their recommendations regarding quantitative values, such as carbohydrate and sugar intake, were reported to be weak [30]. In our study, although the importance of dietary fiber intake was frequently mentioned textually in the nutrition recommendations generated by LLMs, the majority of AI-generated diets did not adequately and consistently include fiber-rich food groups. This finding indicates that while LLMs are strong in generating explanatory recommendations, they remain limited in translating these suggestions into quantitative and food-based dietary planning. A further notable finding was the failure of most LLMs to provide complete micronutrient data despite explicit prompting. This is clinically relevant in T2DM, as deficiencies in key micronutrients such as magnesium, vitamin D, and zinc have been associated with impaired glycemic control, representing an additional gap in the clinical utility of current LLM outputs [31,32].

Although meal-level macronutrient distributions were not analyzed in detail in this study, the low dietitian evaluation scores observed for some models suggest that AI-generated diets may have limitations not only in terms of total daily macronutrient targets but also regarding meal structure and overall daily planning consistency. This finding indicates that while LLMs may demonstrate a certain level of adequacy in calculating total nutrient values, they remain insufficient in reflecting meal planning, which is a key component of MNT. Current diabetes guidelines emphasize that one of the most important components of MNT is individualization. The individualized approach prioritizes nutrition therapy that takes into account personal characteristics, clinical findings, and lifestyle factors rather than applying uniform dietary recommendations [3,17,24]. From a clinical perspective, these findings suggest that large language models should not currently be used as standalone tools for medical nutrition therapy in T2DM. Instead, they may serve as supportive tools within supervised clinical workflows, where dietitians critically evaluate and adapt AI-generated recommendations. The systematic deviations observed in energy and macronutrient composition highlight the need for guideline-informed model development, domain-specific training datasets, and rigorous clinical validation before such tools can be safely integrated into routine nutrition practice.

5. Strengths and Limitations

This study has several strengths. All LLMs were tested using the same standardized prompt in independent sessions, allowing a controlled comparison across models. In addition, AI-generated meal plans were compared with a reference diet developed by an expert dietitian on the basis of current evidence-based recommendations for T2DM, which provided a clinically meaningful benchmark. Another strength is that the study combined quantitative nutrient analysis with independent expert scoring of guideline adherence, enabling both analytical and clinical interpretation of LLM-generated outputs.

Nevertheless, some limitations should be acknowledged. The analysis was limited to a single standardized 3-day, 1800 kcal T2DM scenario, and therefore the findings may not be directly generalized to other dietary prescriptions or more complex clinical contexts. In weight-loss scenarios with lower energy targets, macronutrient distribution patterns may differ substantially from those observed in this study. More critically, in renal-modified diets required for patients with diabetic nephropathy, protein restriction and electrolyte management represent essential safety parameters; given the inter-model variability in protein content identified in the present study, the clinical risk associated with LLM-generated dietary plans may be considerably greater in such contexts. The number of evaluated LLMs was also limited to those accessible during the study period, and because the performance of such systems changes rapidly over time, the results should be interpreted as a time-specific assessment. Despite explicit instructions to include micronutrient data, most LLMs failed to provide day-specific calculated values, reporting only general targets or averages for selected nutrients, which precluded a systematic quantitative micronutrient analysis. The study therefore primarily focused on macronutrients and dietary fiber as core components of T2DM meal planning. Future research should examine a broader range of nutrients and clinical scenarios, including weight-loss and renal-modified dietary contexts, more diverse patient profiles, and newly released models.

6. Conclusions

This study demonstrates that current large language models generate dietary plans for T2DM that show systematic deviations from guideline-based reference diets. In our study, AI-generated diets largely exhibited similar energy and macronutrient profiles and did not incorporate adaptations based on individual characteristics. LLMs appeared to treat individuals with T2DM largely as a homogeneous group. In contrast, clinical factors such as sex, age, body weight, and comorbid conditions can substantially influence individual energy and protein requirements. Despite this, when provided with a standardized prompt, LLMs did not request additional information to support individualization and failed to deliver personalized dietary recommendations. This contradicts the principle of individualized nutrition therapy and underscores that dietitian guidance remains essential in the management of T2DM. Nevertheless, in an evolving digital landscape, the presence of LLMs is undeniable, and they may be utilized as supportive tools by dietitians within MNT.

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Magliano D.J. Boyko E.J. IDF Diabetes Atlas 10th Edition Scientific Committee IDF Diabetes Atlas 10th ed.International Diabetes Federation Brussels, Belgium 2021
2Davies M.J. Aroda V.R. Collins B.S. Gabbay R.A. Green J. Maruthur N.M. Rosas S.E. Del Prato S. Mathieu C. Mingrone G. Management of Hyperglycaemia in Type 2 Diabetes, 2022. A Consensus Report by the American Diabetes Association (ADA) and the European Association for the Study of Diabetes (EASD)Diabetologia 2022651925196610.1007/s 00125-022-05787-236151309 PMC 9510507 · doi ↗ · pubmed ↗
3American Diabetes Association Professional Practice Committee for Diabetes* 5. Facilitating Positive Health Behaviors and Well-Being to Improve Health Outcomes: Standards of Care in Diabetes-2026 Diabetes Care 202649 S 89S 13110.2337/dc 26-S 00541358898 PMC 12690188 · doi ↗ · pubmed ↗
4Ali H.I. Al Harbi A.I. Alnahdi M. Alshamsi M.S. Aldhaheri M.R. Al Meqbaali S.M. Aldahmani M.A. Ilesanmi-Oyelere B. Al Dhaheri A.S. Ismail L.C. Self-Efficacy, Self-Regulation, and Barriers to Dietitian-Led Nutrition Counseling: Qualitative Interviews with Adults with Diabetes in the United Arab Emirates J. Prim. Care Community Health 2025162150131925138830610.1177/2150131925138830641267461 PMC 12639237 · doi ↗ · pubmed ↗
5Belkhouribchia J. Pen J.J. Large language models in clinical nutrition: An overview of its applications, capabilities, limitations, and potential future prospects Front. Nutr.202512163568210.3389/fnut.2025.163568240851903 PMC 12367769 · doi ↗ · pubmed ↗
6Panayotova G.G. Artificial intelligence in nutrition and dietetics: A comprehensive review of current research Healthcare 202513257910.3390/healthcare 1320257941154258 PMC 12563881 · doi ↗ · pubmed ↗
7Bayram H.M. Arslan S. Ozturkcan A. Evaluating AI-Generated Meal Plans for Simulated Diabetes Profiles: A Guideline-Based Comparison of Three Language Models J. Eval. Clin. Pract.202531 e 7029510.1111/jep.7029541055593 · doi ↗ · pubmed ↗
8Fast D. Adams L.C. Busch F. Fallon C. Huppertz M. Siepmann R. Prucker P. Bayerl N. Truhn D. Makowski M. Autonomous medical evaluation for guideline adherence of large language modelsnpj Digit. Med.2024735810.1038/s 41746-024-01356-639668168 PMC 11638254 · doi ↗ · pubmed ↗