Patient-specific deep offline artificial pancreas for blood glucose regulation in type 1 diabetes

Yixiang Deng; Kevin Arao; Christos S. Mantzoros; George Em Karniadakis

PMC · DOI:10.1016/j.smhl.2026.100633·March 1, 2026

Patient-specific deep offline artificial pancreas for blood glucose regulation in type 1 diabetes

Yixiang Deng, Kevin Arao, Christos S. Mantzoros, George Em Karniadakis

PDF

Open Access

TL;DR

This paper introduces a personalized artificial pancreas system using AI to better regulate blood glucose levels in type 1 diabetes patients, especially during physical activity.

Contribution

A novel framework combining systems biology-informed neural networks and deep reinforcement learning for patient-specific glucose regulation.

Findings

01

The system improved insulin dosing and glucose control compared to existing methods.

02

Patient-specific models accounted for carbohydrate intake and exercise intensity effectively.

03

The framework reduced risks of hypoglycemia during physical activity.

Abstract

Due to insufficient insulin secretion, patients with type 1 diabetes mellitus (T1DM) are prone to blood glucose fluctuations ranging from hypoglycemia to hyperglycemia. While dangerous hypoglycemia may lead to coma immediately, chronic hyperglycemia increases patients’ risks for cardiorenal and vascular diseases in the long run. In principle, an artificial pancreas – a closed-loop insulin delivery system requiring patients to manually input insulin dosage according to the upcoming meals – could supply exogenous insulin to control the glucose levels and hence reduce the risks from hyperglycemia. However, insulin overdosing in some type 1 diabetic patients, who are physically active, can lead to unexpected hypoglycemia beyond the control of the common artificial pancreas. Therefore, it is important to take into account the glucose decrease due to physical exercise when designing the…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Figures7

Click any figure to enlarge with its caption.

D](#F2)). In BCQ, a buffer dataset is first collected by some behavior policy $[eqn]$ , before the training starts. Specifically, we generated a buffer by sampling from the OhioT1DM dataset and generate the states, i.e., glucose levels, carbon intakes and physical exercises, along with the action denoted by total exogenous insulin at a specific time point, i.e., a sum of the bolus insulin and basal insulin, and the corresponding returns depending on the resulting glucose levels. Afterwards, the agent represented by a deep neural networks is trained with the RL algorithm ([Fujimoto et al., 2019

A](#F4) and [Fig. 5A](#F5)), suggesting that patient 591 may have a higher risk of developing hyperinsulinemia than patient 588. We also observed that some of the hidden parameters of patient 588 do not fluctuate as significantly over time as those of patient 591 ([Fig. 4B](#F4) and [Fig. 5B](#F5)). These parameters are $[eqn]$ , denoting the rate of insulin addition into the plasma from exogenous insulin, $[eqn]$ , denoting the rate of exercise-induced hepatic glucose production, $[eqn]$ , denoting the rate of exercise-induced glucose uptake, and $[eqn]$ , denoting the rate of exercise-induce

Keywords

Artificial pancreasType 1 diabetesPhysical exerciseWearable devicesOffline reinforcement learningDigital twin

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDiabetes Management and Research · Pancreatic function and diabetes · Hyperglycemia and glycemic control in critically ill and hospitalized patients

Full text

Introduction

Diabetes mellitus is a growing epidemic and its prevalence has been increasing in nearly all countries (Lovic et al., 2020). Global estimates show that the prevalence of adults aged 20–79 years is 8.8% in 2015, and it is predicted to rise to 10.8% in 2040 (Ogurtsova et al., 2017). In the US, according to the Centers for Disease Control and Prevention (CDC), a total of 34.2 million people have diabetes or 10.5% of the US population in 2018 (Prevention, 2020). The most common forms of diabetes are type 1 diabetes mellitus (T1DM) and type 2 diabetes mellitus (T2DM). In 2016, US data showed T1DM and T2DM accounted for approximately 6% and 91% of all cases of diagnosed diabetes, respectively (Bullard et al., 2018). T1DM is due to autoimmune Beta cell destruction, which usually leads to absolute insulin deficiency. On the other hand, T2DM is due to progressive loss of adequate Beta cell insulin secretion with associated insulin resistance (Care, 2022; Skyler et al., 2017). Glucose is integral to energy consumption as it serves as a primary metabolic fuel. Under normal physiology, in a fasting state, there is a basal insulin secretion to help match hepatic gluconeogenesis to maintain a blood glucose target between 70 and 130 mg/dl. After a meal, there is a rise in blood glucose levels resulting in a concomitant increase in the insulin secretion from the pancreas (Nakrani et al., 2022). The major effects of insulin on glucose metabolism are the following: (a) increases glucose transport across the cell membrane in adipose tissue and muscle, (b) increases glycolysis in muscle and adipose tissue, (c) stimulates glycogenesis and inhibits glycogenolysis in muscle, and liver, and (d) inhibits gluconeogenesis in the liver (Newsholme & Dimitriadis, 2001). A few hours after the meal, as the blood glucose concentration falls, glucagon is secreted to release glucose back into the blood which decreases glucose fluctuations. The main goal of treatment, especially for T1DM patients, is to mimic this physiologic insulin secretion by providing appropriate basal and prandial insulin bolus doses. However, physical exercise often initiates a rise in insulin levels and could potentially cause dangerous hypoglycemia even when the infused insulin is carefully planned. This effect could be amplified in patients with impaired counter-regulation and even a short episode of antecedent hypoglycemia may worsen exercise responses and hence subsequent hypoglycemia (Cockcroft et al., 2020; Davis et al., 2000; Hopkins, 2004). Patients with T1DM are frequently suffering from complications associated with unstable glucose levels when the blood glucose (BG) regulation is not well controlled. According to the Diabetes Control and Complications Trial, better BG control leads to lower HbA1c levels, which finally leads to better outcomes in terms of both microvascular and macrovascular complications (Nathan et al., 2014). Hence, achieving stable glucose levels has been a well-established goal in diabetes management.

Combining a glucose sensor, an insulin infusion device, and a control algorithm, artificial pancreas (AP) is a closed-loop system designed for patients with T1DM to improve their BG regulation, and consequently decrease the risk of diabetic complications (Beck et al., 2019; Cobelli et al., 2011; Ramkissoon et al., 2017; Weisman et al., 2017). While wearable minimally-invasive continuous glucose monitoring (CGM) sensors can provide real-time measurements of blood glucose concentration for days by measuring the glucose concentration in the interstitial fluid (Cappon et al., 2019), commercial products measuring plasma insulin are not widely accessible. The insulin pump, which is an infusion device, provides continuous insulin administration through subcutaneous infusion, instead of needle injections, and greatly improves the quality of life for patients with diabetes (Dwibedi et al., 2022; Ly et al., 2019; Pickup & Keen, 2002; Zhang et al., 2021). Generally, computational models for glucose prediction can be categorized into three different types, i.e., physiology-driven models which are carefully designed by clinicians describing the glucose consumption and production among multiple organs or tissues (Bergman et al., 1979; Hovorka et al., 2004; Man et al., 2014; Visentin et al., 2016), data-driven models which are constructed by training machine learning models based on data collected from glucose monitors (Deng et al., 2021; Lee et al., 2020), and a hybrid approach combining both of them (Woldaregay et al., 2019). In this work, we choose a hybrid model combining a physiology-driven ODE system and deep neural networks with a priority on inferring the hidden dynamics and hidden parameters to build patient-specific in silico glucose-insulin dynamics, in addition to glucose prediction. Specifically, considering the glucose reduction effect due to physical exercises using real-time motion sensors could improve glucose prediction and hence benefit glucose regulation. We were particularly interested in the straightforward and explicit incorporation of physical exercise from a motion sensor and hence chose to use the Roy-Parker model with such functionality, instead of those popular physiology-driven models that only focus on integrating CGM data and insulin pump data, despite their remarkable performance.

Therefore, the next-generation AP in precision medicine will benefit from a holistic design considering important features such as patient-specificity, physical exercises, meal intake. Fortunately, emerging efforts using artificial intelligence (Bent et al., 2021; Deng et al., 2021; Gordon & Stern, 2019; Zhang et al., 2022; Zhu et al., 2022), especially deep neural network-based data-driven machine learning algorithms, provide us with opportunities to predict glucose levels and characterize the complex glucose-insulin dynamics and further enhance the development of control algorithms like reinforcement learning (RL). Unlike supervised learning where a direct input and output mapping is given explicitly, online RL algorithms define the control problem as a Markov Decision Process (MDP) and train an agent, which collects a step-by-step reward through interactions with the environment of interest. The goal of the agent is to respond to the environment changes so that the total reward of the series of responses, namely actions, is maximized by the end of training. Generally, online RL requires a demanding setting where the agent is trained through trial-and-error interactions with a dynamic environment, which can be dangerous for glucose regulation tasks. During these trial-and-error steps, the RL agent may try to optimize for short-term rewards such as glucose levels and end up prescribing too high doses of insulin, which could lead to dangerous hypoglycemia and even unexpected hospitalization when one uses rapid insulin which has a short half-life. Recently, by interacting with in silico simulators, like UVA/Padova T1DMS, several studies developed insulin optimization models based on online RL algorithms for continuous action space, such as Soft Actor-Critic (Fox et al., 2020; Lim et al., 2021), Deep Deterministic Policy Gradient (Zhu et al., 2020), Normalized Advantage Function (Raheb et al., 2022), and Proximal Policy Optimization (Viroonluecha et al., 2022). In general, offline reinforcement learning algorithms are believed to be more efficient than online reinforcement learning because they avoid the interaction with the environment, and it is appreciated that data collections could often be expensive and risky (Fujimoto & Gu, 2021). Unlike online RL, offline RL algorithms utilize only previously collected offline real-world data and do not require additional online interaction with the environment (Levine et al., 2020), providing a promising opportunity for healthcare challenges, where automated drug infusion is necessary (Cai et al., 2023; Emerson et al., 2022). In addition, offline RL algorithms train agents using previously collected data, with no extra interaction with the environment (Kaelbling et al., 1996; Levine et al., 2020), hence minimize the potential risks of harm to diabetic patients. In this study, we focused on prioritizing safety in developing reinforcement learning algorithms in healthcare applications, specifically automating insulin delivery in type 1 diabetes, which is one of the advantages of offline reinforcement learning algorithms. However, the benefits of offline RL also come with some disadvantages, for example, the distributional shift, i.e., while the function approximations might be trained under one distribution, it will be evaluated on a different distribution. To address this challenge, policy constraint methods to offline RL utilize either parameterization or regularization techniques (Wu et al., 2022). Among these approaches, batch-constrained Q learning (BCQ) utilizes policy constraint through parameterization, while the Twin Delayed Deep Deterministic Policy Gradient algorithm with behavior cloning (TD3+BC) achieves a similar outcome through straightforward regularization. In the present work, we apply these two representative offline RL algorithms to optimize the insulin dosage at the patient-specific level.

Motivated by the features highlighted for next-generation AP design, we propose a novel framework to design a patient-specific artificial pancreas using up-to-date hardware and software technologies with digital twins (Fig. 1). By simultaneously considering patient-specificity, meal intakes, insulin infusion, and most importantly physical exercise, we build and optimize patient-specific glucose levels using a system of ordinary differential equations (ODE) developed by Roy and Parker (Roy & Parker, 2007), time-dependent systems biology informed neural networks (SBINN) (Yazdani et al., 2020), wearable sensor data from the OhioT1DM dataset (Marling & Bunescu, 2020) and two offline RL algorithms. Specifically, when training the time-dependent SBINN, we implement self-adaptive weight to automatically adjust the coefficient of each loss term in the loss function (McClenny & Braga-Neto, 2020), and obtain the dynamics of hidden states and hidden parameters. We design the offline agents such that they learn the insulin dosage with only a short sequence of past glucose levels without any meal or exercise announcements, which significantly advances the step towards an authentic closed-loop system for artificial pancreas design.

Methods

Framework of the study

2.1.

In this work, we developed a computational framework to design a patient-specific automated insulin delivery system for six patients with type 1 diabetes using patient-specific data from the OhioT1DM dataset. The OhioT1DM dataset contains eight-week continuous glucose monitoring, insulin, physiological sensor, and self-reported life-event data for 12 patients with type 1 diabetes, among which 6 patients participated in the 2018 cohort and the other 6 patients in the 2020 cohort. A workflow of this framework is shown in Fig. 2A. We first implemented systems biology informed neural networks (SBINN) on the OhioT1DM dataset for parameter inference of the patient-specific Roy-Parker model (Roy & Parker, 2007), based on historical records of total exogenous insulin (bolus insulin and basal insulin), carbohydrate intakes, heart rate, and CGM measured glucose level. We then trained a deep offline RL neural network to build a patient-specific automated insulin delivery system for two representative patients in the OhioT1DM dataset. The final optimized agent, represented by deep neural networks, can serve as the patient-specific artificial pancreas, leading to a better insulin dosage scheme for the patient.

Dataset

2.2.

Dataset overview.

All patients in the OhioT1DM were on insulin pump therapy with continuous glucose monitoring (CGM) throughout the 8-week data collection period. Since the 8-week data was split into training data (first 7 weeks)and testing data (the final 8th week) with no clear instruction to enable an accurate merge of data, we only used the first 7 weeks of training data. Based on the form of the ODE model, i.e., the Roy-Parker model, we selected from the OhioT1DM dataset the following historical measurements: (1) the CGM blood glucose level, (2) insulin doses, both bolus insulin and basal insulin, (3) self-reported meal times and the amounts of carbohydrate intakes, and (4) the heart rate. We note that only the data of those 6 patients in the 2018 cohort is used in our analysis, due to the lack of heart rate monitoring in the 2020 cohort.

Data preprocessing.

The exogenous insulin is calculated as the sum of basal insulin and bolus insulin at each moment. According to the user manual of the insulin pumps, i.e., Medtronic 530G and 630G, while the basal insulin $[eqn]$ is given at a rate and is provided explicitly in the electronic health record, bolus insulin $[eqn]$ is a one-time dose and can be released into the blood stream using different mode, i.e., “normal”, “normal dual”, “square” and “square dual”. Additionally, we also consider “temp basal” insulin $[eqn]$ , which overrides the basal insulin set previously. Given limited information for the exact releasing process of these different mode in the OhioT1DM dataset, we assume the conversion formula based on literature (Heinemann, 2009) and the user guides of the corresponding commercial insulin pumps (Medtronic Diabetes, 2016, 2018). The mathematical formula of total insulin infusion rate $[eqn]$ is given as follow.

[eqn]

where $[eqn]$ denotes that the “temp basal” option is available at time $[eqn]$ . Depending different mode, $[eqn]$ is computed as follow,

[eqn]

where $[eqn]$ denotes single-dose bolus insulin, 10 approximates the releasing time in minutes for bolus insulin at “normal” mode, $[eqn]$ denotes the moment the corresponding mode ends and $[eqn]$ denotes the moment the corresponding mode starts are given in the electronic health record. In “dual” mode, the bolus dose is evenly divided in two halves and released with two modes sequentially.

The glucose consumption rate $[eqn]$ due to meal intakes is computed by the amount of meal carbohydrate using an exponential decay function (Yazdani et al., 2020) as follows,

[eqn]

where $[eqn]$ gram of carbohydrate intake is recorded at $[eqn]$ is the total number of meals, and the decaying constant is derived from the glucose-insulin study case in Yazdani et al. (2020), based on the work of Sturis et al. (1991).

Several studies have demonstrated the feasibility of using a target heart rate as a tool for exercise prescription (Karvonen & Vuorimaa, 1988; Porcari et al., 2015; Thomson et al., 2019). We specifically use heart rate collected from the fitness band to quantify the exercise intensity, represented by the percentage of $[eqn]$ with an empirical formula as follows (Lepretre et al., 2004),

[eqn]

where 8 denotes the average $[eqn]$ for a person at the basal state (Roy & Parker, 2007). For missing heart rate values, we apply data imputation using linear interpolation with adjacently available heart rates. After converting all time sequences into the same time resolution, we trim the time sequences of different sources such that the starting time stamp of the final data is the latest of all time sequences and the ending time stamp of the final data is the earliest of all time sequences. We also further smooth the data using a rolling window and generate a coarse-grained dataset, where the sampling interval is 1 h between neighboring time points. Fig. S2 shows the processed insulin infusion $[eqn]$ , carbohydrate intake $[eqn]$ and exercise intensity $[eqn]$ for 6 patients in the OhioT1DM dataset. Table S2 shows the preprocessed dataset for 6 patients.

Systems biology informed neural networks (SBINN) with the Roy-Parker model

2.3.

Roy-Parker model.

With the aim of developing a robust closed-loop insulin delivery system under changing physiological conditions, Roy and Parker developed a model that can predict blood glucose levels at rest and during physical exercises (Roy & Parker, 2007). Since the patients participated in OhioT1DM only performed sporadic and light physical activities, we modified the ODE system in Roy and Parker (Roy & Parker, 2007) by omitting the $[eqn]$ term representing the decline of the glycegenolysis rate during prolonged exercise due to the depletion of liver glycogen stores (Fig. 2B). The resulting ODE system for the 6 state variables $[eqn]$ is shown in Eqs. (5)–(10),

[eqn]

[eqn]

[eqn]

[eqn]

[eqn]

[eqn]

This ODE system captures the exercise-induced dynamics of plasma insulin concentration $[eqn]$ , remote insulin concentration $[eqn]$ , the plasma glucose level $[eqn]$ , exercise-induced hepatic glucose production $[eqn]$ , exercise-induced glucose uptake $[eqn]$ , exercise-induced insulin removal from the circulatory system $[eqn]$ , exogenous infusion $[eqn]$ , and external glucose uptake $[eqn]$ . The instant parameters $[eqn]$ and $[eqn]$ represent the basal plasma insulin and glucose concentrations, respectively. Given the multi-scale nature of the glucose levels in a month-long observation, we found that our algorithm learns better the glucose dynamics when we allow the parameters in the ODE to vary over time. Table S1 shows the nomenclature, physiological meaning and reference values of patient-specific parameters to be inferred in the ODE. The ranges of the parameters were set to $[eqn]$ , where $[eqn]$ represents the corresponding reference value of the variable in Table S1.

Systems biology informed neural networks (SBINN).

Yazdani et al. developed a general framework, namely systems biology informed neural networks (SBINN), to solve all states described by a system of ODEs as well as simultaneously estimating the parameters involved (Yazdani et al., 2020). Fig. 2C shows the structure of SBINN, which is sequentially composed of an input-scaling layer to allow input normalization for the robust performance of the neural networks, a feature layer marking different patterns of state variables in ODEs and the output-scaling layer to convert normalized state variables back to physical units. By effectively adding constraints derived from the ODE system to the optimization procedure, SBINN is able to simultaneously infer the dynamics of unobserved species, external forcing, and the unknown model parameters.

Given the measurements of $[eqn]$ at times $[eqn]$ , SBINN enforces the network to satisfy the ODE of interest at the time point $[eqn]$ . To solve an initial value problem or a final value problem for ODE which encoded physics, one needs to compute the solution such that it satisfies the initial and/or final values as well as minimizing the residue of ODE. For data-driven approaches like physics-informed neural networks or SBINN, one needs to additionally minimize the difference between the observed data and the neural network used to approximate the state variable, which generates the observed data. Hence, SBINN defines the total loss as a function of both the parameters of the neural networks, denoted by $[eqn]$ and parameters of the ODE, denoted by $[eqn]$ .

[eqn]

where $[eqn]$ is associated with the $[eqn]$ sets of observations of the state variables $[eqn]$ in the ODE to address the data-driven loss; $[eqn]$ represents the residue of ODE to be minimized; $[eqn]$ is defined to satisfy the initial value and/or final value. The final step of SBINN is to infer the neural network parameters $[eqn]$ as well as the unknown ODE parameters $[eqn]$ simultaneously by minimizing the aforementioned loss function via gradient-based optimizers Kingma and Ba (2014).

In this work, our ODE of interest is the modified Roy-Parker model presented above and we substantiated the terms in ODE as follows. The known observed state variable $[eqn]$ is the CGM measured glucose record in the OhioT1DM dataset, i.e., $[eqn]$ , which is used for minimizing the data loss, $[eqn]$ . We used $[eqn]$ to minimize the residual terms in the ODE, shown in Eqs. (5)–(10). Following Yazdani et al. (2020), we imposed the initial condition as the auxiliary loss $[eqn]$ . To improve the training of SBINN and speed up the convergence, we implemented self-adaptive weights over each iteration on the weights of each loss terms (McClenny & Braga-Neto, 2020). The self-adaptive weighted loss $[eqn]$ is as follows,

[eqn]

where $[eqn]$ and $[eqn]$ are trainable, non-negative self-adaptation weights associated to the ODE loss term and auxiliary loss term, respectively. Hence, the objective of the training of neural networks is updated to

[eqn]

The update rules for the self-adaptive weights are given by,

[eqn]

where $[eqn]$ and $[eqn]$ denotes the learning rates for the corresponding weights, which was set to be $[eqn]$ in this work.

Offline reinforcement learning

2.4.

We formulated the RL problem for glucose regulation into an MDP as follows. The state $[eqn]$ is defined by a continuous sequence of glucose levels in a time window $[eqn]$ , i.e., $[eqn]$ , specifically, we used $[eqn]$ to allow the model to learn enough glucose dynamics. The action $[eqn]$ is defined as the insulin infusion rate at time $[eqn]$ , i.e., $[eqn]$ . We fixed the glucose infusion rate $[eqn]$ and exercise intensity $[eqn]$ from the OhioT1DM dataset and trained the agent to provide the optimal exogenous insulin infusion rate $[eqn]$ for a month. The return function $[eqn]$ and the reward function $[eqn]$ of a complete episode are defined as follows,

[eqn]

[eqn]

where we set the target glucose level to be 120 mg/dl, $[eqn]$ . We have imposed a constraint on the maximum insulin infusion rate of 90 mU/min (which is derived from the highest observed infusion rate in the OhioT1DM dataset, based on these six patients) in our code implementation to avoid hypoglycemia induced by insulin overdosage regardless of the weight of patients. We implemented two offline RL algorithms, i.e., BCQ and TD3+BC. Both RL algorithms were tested on an NVIDIA RTX A6000 GPU for 10000 episodes, taking about 24 h in physical runtime. In both methods, the optimal hyperparameter sets were determined by grid search.

Batch-constrained Q learning (BCQ).

Herein, we implemented the batch constrained Q-learning (BCQ) algorithm (Fujimoto et al., 2019) to optimize personalized insulin dosage for each patient from OhioT1DM (Fig. 2D). In BCQ, a buffer dataset is first collected by some behavior policy $[eqn]$ , before the training starts. Specifically, we generated a buffer by sampling from the OhioT1DM dataset and generate the states, i.e., glucose levels, carbon intakes and physical exercises, along with the action denoted by total exogenous insulin at a specific time point, i.e., a sum of the bolus insulin and basal insulin, and the corresponding returns depending on the resulting glucose levels. Afterwards, the agent represented by a deep neural networks is trained with the RL algorithm (Fujimoto et al., 2019). Finally, a policy outperforming the behavioral policy $[eqn]$ is deployed in the patient-specific artificial pancreas. By restricting the action space in order to force the agent towards behaving close to a subset of the given data, BCQ is able to learn successfully without interacting with the environment by considering extrapolation error. The details of the BCQ algorithm are shown in Algorithm 1. Following Fujimoto et al. (2019), we implemented a fully-connected feed forward neural network for the Q-networks to represent the agent’s action, i.e., the insulin dosage and the variational autoencoders (VAE), which is defined by two networks, an encoder network $[eqn]$ and a decoder network $[eqn]$ , where $[eqn]$ is the latent vector, to obtain the consequent glucose response. The hyper-parameters used in this work can be found in Table S3.

TD3+BC. We also test the performance of another offline reinforcement learning algorithm, i.e., TD3+BC (Fujimoto & Gu, 2021), which combines Twin Delayed Deep Deterministic policy gradient (TD3) algorithm (Fujimoto et al., 2018) with behavior cloning (BC). Based on Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015), TD3 improves performance by (1) learning two Q-values, and uses the smaller of the two Q-values to form the targets in the Bellman error loss functions, (2) updating the policy and target networks less frequently than the Q-values and (3) adding noise to the target action, to avoid the policy to exploit Q-value errors by smoothing out Q along changes in action. By adding a single adjustment to the policy update process of the TD3 algorithm with the following policy update rule

[eqn]

where $[eqn]$ denotes the RL policy, $[eqn]$ denotes the distribution of state $[eqn]$ and action $[eqn]$ pair from offline buffer, $[eqn]$ is a hyperparameter. The details of the TD3+BC algorithm are shown in Algorithm 2. For offline buffer generation, we reused the buffer for each patient generated in BCQ. For neural network architectures of actors and critics, we followed the implementation of Fujimoto et al. (2018). The hyper-parameters and architecture details used in this work can be found in Table S4.

Results

To build a surrogate environment for our agent to interact with, we first performed patient-specific parameter inference using SBINN with a system of ODE developed by Roy and Parker (2007). The primary step of parameter inference for a system of ODE is to examine its identifiability. After obtaining the patient-specific parameters of the ODE, which is essential to reconstruct the dynamics of the state variables, we developed a patient-specific offline RL algorithm to learn an optimal planning for the external insulin infusion for two representative patients, which helped them decrease the risk of hypoglycemia and hyperglycemia, respectively.

ODE identifiability of the Roy-Parker model

We first performed structural identification on the ODE for the set of parameters

[eqn]

appeared in Eqs. (5)–(10). Although most of the ODE parameters are not readily available in the OhioT1DM dataset, there are several parameters that are practically available for patients. For example, the patient’s body weight ( $[eqn]$ ) can be measured practically but is not available in OhioT1DM. While $[eqn]$ , the exogenous insulin infusion rate to maintain basal plasma insulin, is not available in the OhioT1DM, but it could be inferred from the mode of the exogenous insulin profile of the specific patient. We considered the identifiability of the aforementioned parameters under different scenarios depending on whether $[eqn]$ and $[eqn]$ are known for the given patient. Table 1 suggests that when $[eqn]$ and $[eqn]$ are both known, other parameters are either globally identifiable or locally identifiable. We adopted this scenario in the following analysis by assuming $[eqn]$ = 60 kg and patient-specific $[eqn]$ to construct the patient-specific model by inferring a subset of the aforementioned parameters $[eqn]$ for one-month period.

Patient-specific parameter inference and kinetics reconstruction using OhioT1DM

In the OhioT1DM dataset, we assumed the patient’s weight is 60 kg, i.e., $[eqn]$ = 60 kg. We note that the weight parameter can be absorbed in $[eqn]$ , which will remain within a reasonable range after rescaling to accommodate a higher body weight (Ferrannini et al., 1985). We also assume that the initial condition of the ODE system is given by $[eqn]$ , where $[eqn]$ is estimated from the basal insulin rate and $[eqn]$ is the initial blood glucose level, both of which are patient-specific estimations from the OhioT1DM dataset. We imposed the initial condition of the state variables as the auxiliary loss. We also imposed smoothing using a moving window of 30 data points on the model inputs to speed up the convergence of parameter inference.

To confirm the accuracy of the inferred parameters, we have plotted the kinetics of both CGM measured BG and BG obtained based on the solution of the system of ODEs. Additionally, we examined the performance of our model with Clarke Error Grid analysis (Clarke et al., 1987), which describes the clinical accuracy of models over the entire range of blood glucose values with clearly labeled domains implying clinical decisions. Fig. 3 shows the Clarke error grid analysis of time-dependent SBINN predicted BG levels, i.e., glucose levels obtained from solving the ODE system using the parameters inferred by SBINN vs. the sensor data collected in OhioT1DM for all six patients. We found that most of the prediction-reference BG pairs for all patients lie in regions A and B, which are helpful for appropriate treatment. Specifically, a time-dependent SBINN method showed high accuracy in five patients (Fig. 3 (A)–(E)), most prediction-reference BG pairs that lie in region A, which are considered clinically accurate. Despite that, a few prediction-reference BG pairs for patient 563 lie in region C; we note that they are mostly in the upper triangle of the grid, which means the prediction BG levels are slightly higher than the reference BG levels, and are not likely to lead to insulin overdose or hypoglycemic event.

Besides accurately predicting glucose levels over time like typical data-driven glucose prediction algorithms, we aim to infer the hidden parameters governing ODE by effectively integrating SBINN and data from wearable sensors, and consequently construct a patient-specific in silico model characterizing glucose-insulin dynamics for subsequent training of offline reinforcement learning agents. The way of determining if the hidden parameters are correctly inferred is by solving the initial value problem and comparing the obtained BG with the BG measured by CGM. Fig. 4 shows the inference of model parameters and hidden kinetics on patient ID 588, who experienced repeated hyperglycemic events during the data collection period. Fig. 5 shows the same inference on another patient ID 591, who experienced a short period of hypoglycemia (glucose level below 80 mg/dl, we adjusted the threshold of hypoglycemia due to an overestimation of glucose levels by CGM (Farrell et al., 2020)) around December 22, 2021. The corresponding results for the other four patients (IDs: 559, 563, 570, 575) can be found in Fig. S3–S6 in the Supplementary Material. Note that the time stamps in OhioT1DM dataset are pre-processed to avoid privacy leakage, hence they do not represent the real collection time. Inspired by observed fluctuation of metabolic reaction rates in oral glucose tolerance test (OGTT) (Yoshino et al., 2022), we adopted a time-varying parameters setting for SBINN to improve the flexibility of our model. Interestingly, while patient 588 and patient 591 followed similar insulin infusion and carb intakes, the only lifestyle difference in the exercise intensity seemingly changed the outcomes of their glucose management (Fig. S2).

We found frequently elevated plasma insulin and remote insulin in patient 591 compared to patient 588 (Fig. 4A and Fig. 5A), suggesting that patient 591 may have a higher risk of developing hyperinsulinemia than patient 588. We also observed that some of the hidden parameters of patient 588 do not fluctuate as significantly over time as those of patient 591 (Fig. 4B and Fig. 5B). These parameters are $[eqn]$ , denoting the rate of insulin addition into the plasma from exogenous insulin, $[eqn]$ , denoting the rate of exercise-induced hepatic glucose production, $[eqn]$ , denoting the rate of exercise-induced glucose uptake, and $[eqn]$ , denoting the rate of exercise-induced plasma insulin depletion during the recovery period. Especially, we observed a sudden increase in terms of $[eqn]$ for patient 591 right after the occurrence of exercise-induced hypoglycemia, right before Jan 01. We also found that some parameters are significantly different between patient 591 and patient 588 on average. These parameters are $[eqn]$ , denoting the rate of plasma insulin clearance, $[eqn]$ , denoting the rate of insulin addition in the remote insulin compartment, and $[eqn]$ , denoting the rate of insulin addition into the plasma from exogenous insulin. Interestingly, all these parameters point to the balance of remote insulin and plasma insulin, with $[eqn]$ and $[eqn]$ being almost doubled in 591 while $[eqn]$ being frequently higher in 588. These findings together imply that the glucose level fluctuation is a complex process involving multiple organs and tissues, and exercise contributes to the occurrence of hypoglycemia in a more complicated way than merely lowering the plasma glucose level.

Our results suggest that time-dependent SBINN successfully infers the fluctuating hidden kinetics as well as the parameters in all these 6 patients, albeit the inter-patient variability due to different daily routines of physical activities and insulin injection. More importantly, we also observed that time-dependent SBINN were able to perform robustly under 5 different random seeds with the uncertainty band confined to an acceptable range. We specifically emphasized the accuracy of our parameter inference, which helped us reconstruct the dynamics by solving a forward problem with high accuracy, indicated by the good match between pink curves (forward solver) and blue curves (real data collection).

Offline reinforcement learning

We systematically compare two offline RL algorithms, i.e., BCQ and TD3+BC, on patients in the OhioT1DM cohort 2018. The glucose trajectories in the original OhioT1DM dataset suggest that most patients experienced recurrent hyperglycemic events, while only one of the patients showed a short period of hypoglycemia (PID591). Informed by medical knowledge that higher insulin infusion can decrease the resulting glucose level, we broadened the permitted amount of the insulin infusion for RL agents by doubling the maximum action (insulin infusion rate) documented in the OhioT1DM dataset, without exceeding the clinically safe insulin dosage (Braithwaite et al., 2020). We trained the offline RL models for 10000 episodes and saved the best episode (defined as the episode when the highest reward was achieved for each algorithm and patient) for later evaluation (Fig. S7). Fig. 6 shows that both BCQ and TD3+BC RL agents provide much better insulin dosage plans for 5 patients, as indicated by the green (BCQ) and blue (TD3+BC) curves staying more frequently in the safe glucose region (BG level between 80 mg/dl and 180 mg/dl) denoted by the green shade, and leaning closer to the target glucose level (120 mg/dl) used to define the maximum reward of action. We also examined the time in range (Table S6), time above range (Table S7) and time below range (Table S8), corresponding to the frequency of glucose levels within 80 mg/dl and 180 mg/dl, below 80 mg/dl and above 180 mg/dl, respectively, for the glucose levels in these three methods. The results suggested that both BCQ and TD3+BC improved time in range in six patients, with TD3+BC showing slightly higher values. Consequently, the other two metrics, i.e., time above range and time below range, are lower in the offline models when compared to the data collected.

Interestingly, the statistical analysis of returns between the three methods on all six patients (Fig. S8) suggests that both offline RL agents are significantly better than the original offline data ( $[eqn]$ -value < 0.05), and there is no significant difference between BCQ agents and TD3+BC agents. In addition, we notice that the optimized insulin trajectories in BCQ seem noisier than those in TD3+BC because the VAE in BCQ reconstructs the variability of insulin infusion from the original offline dataset. While TD3+BC also imitates the distribution of action in the offline dataset with a behavior cloning term, TD3+BC supports tuning the strength of behavior cloning with a hyperparameter $[eqn]$ , i.e., a higher $[eqn]$ favors RL and a lower favors imitation, which was optimized during our training for patients. Although the original neural network architectures presented in TD3+BC and BCQ could potentially provide a better insulin plan compared to the offline dataset, we noted that increasing the depth of the actor network in TD3+BC and expanding the latent space dimension of VAE in BCQ resulted in a better glucose level in patients 563 and 575. This is probably because the non-linearity of glucose-insulin dynamics is more pronounced in these patients, given that the offline RL training does not require meal announcement or exercise announcement. We also found it beneficial to add a weighting factor $[eqn]$ (Algorithm 1) to the Kullback–Leibler Divergence loss since training VAE in BCQ contributes to faster convergence in some patients. This may imply a variable balance between reconstruction loss and Kullback–Leibler Divergence loss when training with data from different patients.

We also examined quantitative patterns of insulin infusion vs. glucose levels based on these three methods. The top panels in Fig. 7 suggest that the BCQ agents and TD3+BC agents successfully shift the glucose level distribution towards the defined safe range (BG level between 80 mg/dl and 180 mg/dl). Additionally, the right panels in Fig. 7 indicate that both the BCQ agents and TD3+BC agents learn to increase the overall insulin levels in patients with frequent hyperglycemia and lower insulin infusion in the patient with some hypoglycemic events, Fig. 7(B). It is noteworthy that offline RL agents exhibit greater synchronization between insulin administration and glucose levels, i.e., linear trends in the scatter plots of insulin infusion vs glucose levels in most patients, while the original offline dataset does not show such a trend. In addition, even though the offline RL agents have considerable freedom in choosing actions, the optimal agent consistently employs a strategy where the insulin dosage seldom reaches the maximum dose.

Discussion

Insulin is the mainstay of treatment for patients with type 1 diabetes mellitus and, oftentimes, long-standing type 2 diabetes mellitus to achieve good glycemic control (Shah et al., 2014). Overestimation of the necessary insulin dosage can be extremely dangerous and may lead to fatally low blood glucose levels below 70–80 mg/dl when measured by CGM, namely hypoglycemia, while an underestimated insulin dosage, leaving blood glucose above 180 mg/dl may result in hyperglycemia, which is believed to be responsible for micro- and macro-vascular diseases in the long run. In modern medicine, the use of insulin pumps along with continuous glucose monitors has made it easier, but requires significant resources and patient education. Fortunately, a closed-loop control system, also called an artificial pancreas (AP), which automates insulin infusion to maintain a consistently stable blood glucose level, undoubtedly relieves the burden of both patients and doctors and saves medical costs.

Specifically, we attempted to address a few challenges in designing next-generation APs with our framework, which effectively combines three key components to build a patient-specific artificial pancreas, simultaneously considering real-world data collected from wearable devices, i.e., meal intake, insulin infusion, and physical exercises. These important components are: (1) a real-world historical medical dataset, namely the OhioT1DM dataset, containing patient-specific glucose, insulin, meal intake, and exercise intensity; (2) a flexible ODE model defining glucose-insulin dynamics by systematically prioritizing two significant external factors affecting glucose levels, i.e., meal intake and physical exercise intensities; and (3) two offline RL algorithms without directly interacting with real patients’ metabolic environment. According to the Centers for Disease Control and Prevention, diabetic patients are advised to perform the evaluation of their treatment goals every 3 months or 6 months in clinical visits. To ensure broad applicability in the evaluation of diabetes management for patients, we designed this framework to operate on a monthly time scale without sacrificing its generalizability.

Despite the variations arising from distinct daily physical activity and insulin injection patterns among six different individuals, our model not only correctly predicts the hidden states that cannot be measured with current diabetes technology, but also accurately infers parameters governing the patient-specific Roy-Parker model. In addition, we also noted the consistent performance of the time-dependent SBINN across distinct runs. These findings collectively serve as robust evidence for the strong generalizability of our framework. More importantly, the uncertainty-qualified hidden parameters provide patient-specific clinical interpretations on how patients’ behavioral pattern shape their corresponding glucose-insulin dynamics. To design a safe artificial pancreas, we focused on developing reinforcement learning algorithms, which avoid dangerous data collection. Furthermore, we trained offline RL agents with two different offline RL algorithms, i.e., BCQ and TD3+BC, to automate insulin infusion and optimize the performance of the agent on the patient-specific ODE with the same glucose uptake and exercise intensity over time. Our results suggest that our offline RL agent has a better performance in terms of maintaining blood glucose levels within the safe range, compared to the self-operated insulin infusion by patients themselves. This implies that an agent trained by offline RL could learn a better insulin dosage depending solely on past glucose level sequences without meal or exercise announcements. This design could decouple the learning step of physiological parameters by SBINN from the training step of an optimal insulin agent by offline RL, hence leading to a cost-effective AP.

In spite of the improved glycemic control provided by our offline agent and minimal human intervention demanded by our framework, we can still identify possible improvements to the proposed framework, considering ODE model development, disease characterization, and data processing. As it is believed that there is both a time delay in the effect of insulin on glucose production and that on the glucose utilization (Sturis et al., 1991), it may be helpful to modify the ODE model to address the sluggish effects. Additionally, insulin can be categorized into fast-acting, intermediate-acting, and long-acting based on the timing of its action in the body. To account for this variability, we could extend the data preprocessing of external insulin and update the ODE model to allow for variations in insulin types (Evans et al., 2019; Zijlstra et al., 2018). To further enhance the closed-loop systems, it may be worthwhile to explore the incorporation of glucagon as a dimension of the action, a hormone that stimulates glucose production and can therefore increase plasma glucose levels (Peters & Haidar, 2018). By adding glucagon as an additional action, the RL agent will be able to explore the insulin action space with fewer constraints. In addition, glucose-insulin dynamics is a complex and multi-scale process being affected by other external factors, such as body weight variation, mental health (Anderson et al., 2001), drug-drug interaction (Triplitt, 2006). Due to missing body-weight information in the dataset, all patient-specific models assumed a nominal weight of 60 kg. Although this parameter can be absorbed into the glucose distribution volume, it remains a simplification that may underestimate inter-individual physiological variability. Future work will incorporate actual patient anthropometric data (e.g., body weight, BMI) to further enhance personalization and physiological fidelity. Fortunately, the proposed framework allows adaptation of the ODE form and the action space, hence enabling the incorporation of more external factors as forcing terms in ODE and extension of existing actions in offline RL agents. The accurate characterization of glucose afforded by this study at a monthly scale provides valuable insights for its potential adaptation to a smaller time scale, such as weekly, particularly for patients whose disease progresses at a faster rate than usual. We also note that systems biology-informed neural networks provide a promising opportunity to infer hidden dynamics in a biological system of interest, by leveraging the high expressivity of neural networks and primitive characteristics of the system in a low data availability scenario (Raissi et al., 2017). Whereas, in the context of healthcare, further multi-center clinical trials involving more participants will greatly improve the robustness of our framework. Another notable limitation of the current dataset is the imbalance between hyperglycemic and hypoglycemic events. Although one of the key motivations of this work is to improve the prevention of exercise-induced hypoglycemia, clinically significant hypoglycemia occurred predominantly in a single patient (PID 591), while the majority of participants exhibited mainly hyperglycemic excursions. As a result, the present study provides stronger validation for hyperglycemia management than for hypoglycemia prevention. Future studies incorporating datasets with a broader distribution of hypoglycemic episodes – particularly during and after physical activity – will be essential for fully evaluating and refining the framework’s ability to anticipate and prevent low-glucose events.

In summary, by seamlessly integrating real-world data from wearable sensors, time-dependent systems biology informed neural networks, and deep offline RL algorithms, we have developed a universal framework that could shed light on the patient-specific digital twin designs where adequate biological numerical models are established and enough medical data from sensors are available. In the context of type 1 diabetes, we found that SBINN could successfully infer the hidden physiological parameters governing glucose-insulin dynamics and accurately reconstruct the corresponding glucose trajectories at the patient’s level. In terms of deployability, both the SBINN and the offline RL controller operate in inference-only mode and therefore have low computational overhead. Inference can be performed in milliseconds on embedded processors used in current artificial pancreas devices, and future AP platforms are expected to include even more capable on-device AI hardware. Compression and optimization techniques (e.g., pruning, quantization) further support real-time implementation. Thus, the proposed framework is compatible with the computational requirements of modern and next-generation AP systems.Despite the limited number of patients, albeit with a complete dataset, we found that TD3+BC offline RL is not only more effective but also easier to tune than BCQ. Our framework shows translational potential in precision medicine by enriching digital models with extensive medical data collected from wearable devices (Raza et al., 2022; Venkatesh et al., 2022). For example, with a few slight modifications on the ODE model to address the insulin resistance of tissue, we could model and optimize insulin usage in patients with type 2 diabetes who develop insulin resistance (Cotero et al., 2022). Efforts in hardware designs of wearable sensors are also beneficial to our model development. For example, novel wearable devices are being designed to probe insulin and other metabolites that are impossible to quantify with existing techniques with a higher resolution (Poudineh et al., 2021; Wang et al., 2022). Furthermore, the rapid development in non-invasive wearable devices tremendously increases patients’ compliance in using wearable devices (Mitratza et al., 2022), which helps to reduce data sparsity with more frequent sensor readouts, and consequently improves the performance of our framework. These next-generation sensors will provide us with high-resolution and unprecedented data, which will significantly enhance the performance of our framework. Future artificial pancreas systems are expected to include more capable embedded processors and lightweight AI accelerators, which will support increasingly complex control algorithms. Because our framework operates entirely in inference mode and has a compact architecture, it is well suited for deployment on these next-generation devices.

Supplementary Material

1

Bibliography77

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anderson RJ, Freedland KE, Clouse RE, & Lustman PJ (2001). The prevalence of comorbid depression in adults with diabetes: a meta-analysis. Diabetes Care, 24(6), 1069–1078.11375373 10.2337/diacare.24.6.1069 · doi ↗ · pubmed ↗
2Beck RW, Bergenstal RM, Laffel LM, & Pickup JC (2019). Advances in technology for management of type 1 diabetes. The Lancet, 394(10205), 1265–1273.
3Bent B, Cho PJ, Henriquez M, Wittmann A, Thacker C, Feinglos M, Crowley MJ, & Dunn JP (2021). Engineering digital biomarkers of interstitial glucose from noninvasive smartwatches. Npj Digital Medicine, 4(1), 89.34079049 10.1038/s 41746-021-00465-w PMC 8172541 · doi ↗ · pubmed ↗
4Bergman RN, Ider YZ, Bowden CR, & Cobelli C (1979). Quantitative estimation of insulin sensitivity.. American Journal of Physiology-Endocrinology and Metabolism, 236(6), E 667.
5Braithwaite SS, Barakat K, Idrees T, Qureshi F, & Soetan OT (2020). Algorithm maxima for intravenous insulin infusion. Diabetes Technology & Therapeutics, 22(11), 861–864.32915059 10.1089/dia.2020.0343 PMC 7698999 · doi ↗ · pubmed ↗
6Bullard KM, Cowie CC, Lessem SE, Saydah SH, Menke A, Geiss LS, Orchard TJ, Rolka DB, & Imperatore G (2018). Prevalence of diagnosed diabetes in adults by diabetes type—United States, 2016. Morbidity and Mortality Weekly Report, 67(12), 359.29596402 10.15585/mmwr.mm 6712 a 2PMC 5877361 · doi ↗ · pubmed ↗
7Cai X, Chen J, Zhu Y, Wang B, & Yao Y (2023). Towards safe propofol dosing during general anesthesia using deep offline reinforcement learning. ar Xiv preprint ar Xiv:2303.10180.
8Cappon G, Vettoretti M, Sparacino G, & Facchinetti A (2019). Continuous glucose monitoring sensors for diabetes management: a review of technologies and applications. Diabetes & Metabolism Journal, 43(4), 383–397.31441246 10.4093/dmj.2019.0121 PMC 6712232 · doi ↗ · pubmed ↗