Forecasting intracranial hypertension using multi-scale waveform metrics

Matthias H\"user; Adrian K\"undig; Walter Karlen; Valeria De Luca,; Martin Jaggi

arXiv:1902.09499·eess.SP·December 3, 2020

Forecasting intracranial hypertension using multi-scale waveform metrics

Matthias H\"user, Adrian K\"undig, Walter Karlen, Valeria De Luca,, Martin Jaggi

PDF

TL;DR

This study presents a predictive framework for early detection of intracranial hypertension using multi-scale waveform metrics, achieving high recall rates hours before critical events in traumatic brain injury patients.

Contribution

The paper introduces a novel multi-scale waveform analysis method that improves early prediction of intracranial hypertension over existing approaches.

Findings

01

Predicted hypertensive events up to 8 hours in advance with 90% recall.

02

High-frequency waveform features significantly enhance prediction accuracy.

03

Long-term history up to 8 hours is crucial for effective forecasting.

Abstract

Objective: Acute intracranial hypertension is an important risk factor of secondary brain damage after traumatic brain injury. Hypertensive episodes are often diagnosed reactively, leading to late detection and lost time for intervention planning. A pro-active approach that predicts critical events several hours ahead of time could assist in directing attention to patients at risk. Approach: We developed a prediction framework that forecasts onsets of acute intracranial hypertension in the next 8 hours. It jointly uses cerebral auto-regulation indices, spectral energies and morphological pulse metrics to describe the neurological state of the patient. One-minute base windows were compressed by computing signal metrics, and then stored in a multi-scale history, from which physiological features were derived. Main results: Our model predicted events up to 8 hours in advance with alarm…

Figures6

Click any figure to enlarge with its caption.

Tables10

Table 1. TABLE I: Patient demographic information for ICU stays ( n s subscript 𝑛 𝑠 n_{s} =50) matched with n 𝑛 n =66 recording segments, n 𝑛 n =57 recording segments could not be matched in the MIMIC-III CDB

Age median [IQR]	62.5 [57.0-72.5]
Sex (% male)	38.0
Hospital mortality rate (%)	22.0
ICU LOS days median [IQR]	13.9 [7.4-21.7]
Admission type	Emergency ( $n_{s} = 48$ )
	Elective ( $n_{s} = 2$ )
ICU care service	Neurological surgical ( $n_{s} = 42$ )
	Neurological medical ( $n_{s} = 3$ )
	Unspecified medical ( $n_{s} = 3$ )
	Cardiac medical ( $n_{s} = 1$ )
	Obstetric ( $n_{s} = 1$ )
Diagnosis	Subarachnoid hemorrhage ( $n_{s} = 21$ )
	Intracranial hemorrhage ( $n_{s} = 19$ )
	Interparenchymal hemorrhage ( $n_{s} = 2$ )
	Brain tumor ( $n_{s} = 2$ )
	Headache ( $n_{s} = 1$ )
	Bleed( $n_{s} = 1$ )
	Other (hematology) ( $n_{s} = 1)$
	Other (respiratory) ( $n_{s} = 1$ )
	Other (hepatology) $(n_{s} = 1)$
	Other (obstetric) ( $n_{s} = 1$ )
GCS median [IQR]	3 [3-4]

Table 2. TABLE II: Overview of basic block functions computed on 1-minute windows

Statistical/complexity summaries (ICP, CPP, ABPm/d/s, HR)

Median, Interquartile range, Line length [62], Shannon entropy

Spectral band energy metrics (wICP, wABP, wPLETH, wRESP, ECG)

Energy in frequency bands [0,1],[1,2],[2,3],[3,6],[6,9],[9,12],[12,15] Hz

Autoregulation indices on time series (1 Hz sample rate)

AmpIndex(ICP,ABPm), AmpIndex(ICP,CPP), AmpIndex(CPP,ABPm) [31]

PaxIndex(ICP,CPP,ABPm) [31]

PrxIndex(ICP,CPP,ABPm) [63, 64]

RapIndex(ICP,CPP) [65, 66]

SlowWaveIndex(ICP) [67]

TFIndex(ICP,ABPm), TFIndex(ICP,CPP), TFIndex(CPP,ABPm) [68]

Autoregulation indices on waveforms (125 Hz sample rate)

AmpIndex(wICP,wABP) [31]

SlowWaveIndex(wICP) [67]

TFIndex(wICP,wABP) [68]

IaacIndex(wICP,wABP) [37]

Morphological pulse metrics on waveforms

wABP pulse descriptor (17 metrics) [61]:

A, UpstrokeTime, TimeAt

Π

, TimeAtDw, DownstrokeTime,

SysDiasTimeDifference, HeightSysPeak,

HeightInflPoint, HeightDicroticWave,

R1, R2, R3, R4, R5, R6, Aix

wICP pulse descriptor (20 metrics) [20]:

Mean, Dias, DP1, DP2, DP3, DP12, DP13, DP23,

L1, L2, L3, L12, L13, L23, Curv1, Curv2,

Curv3, Slope, DecayTimeConst, AverageLatency

Table 3. TABLE III: Prediction performance of models by inclusion of physiological time series/waveform channels in the feature generation process

Channels	Prec@75Rec	Prec@90Rec	AUPRC
ICP	0.311 $\pm$ 0.004	N/A	0.462 $\pm$ 0.007
ABP	0.226 $\pm$ 0.001	0.226 $\pm$ 0.000	0.238 $\pm$ 0.001
CPP	0.230 $\pm$ 0.001	0.223 $\pm$ 0.001	0.243 $\pm$ 0.003
ICP/ABP/CPP (1 Hz)	0.332 $\pm$ 0.003	0.267 $\pm$ 0.001	0.443 $\pm$ 0.005
+wICP	0.371 $\pm$ 0.001	0.303 $\pm$ 0.001	0.512 $\pm$ 0.003
+wICP/ABP	0.377 $\pm$ 0.001	0.303 $\pm$ 0.001	0.517 $\pm$ 0.003
+wALL	0.379 $\pm$ 0.002	0.302 $\pm$ 0.001	0.510 $\pm$ 0.003
only wICP	0.358 $\pm$ 0.001	0.299 $\pm$ 0.001	0.516 $\pm$ 0.003
only wICP/wABP	0.366 $\pm$ 0.001	0.298 $\pm$ 0.002	0.516 $\pm$ 0.003

Table 4. TABLE IV: Prediction performance by models based on different basic block functions (first part), and multi-scale history summary functions (second part)

Feature types	Prec@75Rec	Prec@90Rec	AUPRC
Stat/Complexity	0.328 $\pm$ 0.003	0.270 $\pm$ 0.002	0.464 $\pm$ 0.005
+BandEnergy	0.356 $\pm$ 0.002	0.286 $\pm$ 0.001	0.502 $\pm$ 0.003
+AutoRegIndices	0.368 $\pm$ 0.001	0.289 $\pm$ 0.001	0.503 $\pm$ 0.003
+PulseMorphology	0.377 $\pm$ 0.001	0.303 $\pm$ 0.001	0.517 $\pm$ 0.003
Location	0.373 $\pm$ 0.002	0.300 $\pm$ 0.002	0.508 $\pm$ 0.004
Loc+Trend+Variation	0.377 $\pm$ 0.001	0.303 $\pm$ 0.001	0.517 $\pm$ 0.003

Table 5. TABLE V: Comparison of different machine learning methods applied to the optimal features found by SHAP analysis.

ML method	Prec@75Rec	Prec@90Rec	AUPRC
LogReg	0.377 $\pm$ 0.001	0.303 $\pm$ 0.001	0.517 $\pm$ 0.003
MLP	0.373 $\pm$ 0.002	0.300 $\pm$ 0.002	0.514 $\pm$ 0.003
Tree	0.311 $\pm$ 0.001	N/A	0.375 $\pm$ 0.003
GradBoost	0.353 $\pm$ 0.002	0.276 $\pm$ 0.001	0.465 $\pm$ 0.003

Table 6. TABLE VI: Comparison of our proposed model with different baselines from the literature.

Models	Prec@75Rec	Prec@90Rec	AUPRC
Optimal	0.377 $\pm$ 0.001	0.303 $\pm$ 0.001	0.517 $\pm$ 0.003
BL1: Hu et al. [20]	0.331 $\pm$ 0.001	0.274 $\pm$ 0.001	0.473 $\pm$ 0.003
BL2: Myers et al. [21]	0.338 $\pm$ 0.002	0.259 $\pm$ 0.001	0.484 $\pm$ 0.003
	Spec@75Sens	Spec@90Sens	AUROC
as above	0.653 $\pm$ 0.002	0.417 $\pm$ 0.003	0.771 $\pm$ 0.001
	0.577 $\pm$ 0.002	0.336 $\pm$ 0.003	0.738 $\pm$ 0.001
	0.602 $\pm$ 0.003	0.304 $\pm$ 0.003	0.746 $\pm$ 0.001

Table 7. TABLE VII: Overview of 20 most important features for predicting ICH, identified by the SHAP analysis in the 10 splits

Rank	Feature descriptor	Important scales
1	Med(IcpPulse_Dias(wICP))	480,360,30,15,60,180,240
2	Med(SpectralEnergy(wICP)_0-1Hz)	360,480,30,240
3	Med(IcpPulse_Mean(wICP))	480,360,120,180
6	Time since segment start	N/A
9	Med(SpectralEnergy(ECG)_0-1Hz)	480,180,60
15	Med(AmpIndex(ABPm,CPP))	480
17	Med(SpectralEnergy(wPLETH)_2-3Hz)	480
18	Med(SpectralEnergy(wICP)_9-12Hz)	480
19	Med(SpectralEnergy(wRESP)_0-1Hz)	480
22	Med(SpectralEnergy(wPLETH)_0-1Hz)	480
24	Med(SpectralEnergy(wRESP)_1-2Hz)	480
26	Med(IcpPulse_Slope(wICP))	480
27	Med(ShannonEntropy(HR))	480
28	Iqr(SpectralEnergy(ECG)_0-1Hz)	360
29	Current ICP value	N/A
30	Med(SpectralEnergy(wRESP)_2-3Hz)	480
31	Iqr(SlowWaveIndex(ICP))	360
32	Med(Med(ICP))	480
33	Med(SpectralEnergy(wICP)_6-9Hz)	480
34	Iqr(SpectralEnergy(wPLETH)_1-2Hz)	480

Table 8. TABLE VIII: Overall importance of feature categories evaluated by the number of inclusions in the top 100 features, across 10 splits

Feature descriptor	Inclusion count	Normalized inclusion count
Physiological channel
wICP	369	0.050
ECG	344	0.051
wABP	128	0.020
ICP	83	0.029
wRESP	79	0.047
wPLETH	71	0.042
CPP	60	0.023
ABPm	53	0.022
HR	28	0.029
ABPs	25	0.026
ABPd	7	0.007
Base feature function
SpectralEnergy	463	0.055
IcpPulseMorph	200	0.042
Median	65	0.034
Entropy	56	0.029
AbpPulseMorph	53	0.013
LineLength	34	0.018
TFIndex	31	0.032
SlowWaveIndex	26	0.054
AmpIndex	21	0.022
PrxIndex	14	0.058
Iqr	13	0.007
IaacIndex	4	0.017
PaxIndex	4	0.017
RapIndex	3	0.013
Summary function
Median	715	0.076
Iqr	262	0.028
Slope	10	0.001
History length [mins]
480 Mins	465	0.131
360 Mins	242	0.068
240 Mins	104	0.029
180 Mins	71	0.020
120 Mins	34	0.010
60 Mins	28	0.008
15 Mins	24	0.007
30 Mins	19	0.005

Table 9. TABLE IX: Most important physiological metrics extracted from high-frequency ICP/ABP waveforms, among the top 100 features overall

Rank	Feature descriptor	Most important scales
1	Med(IcpPulse_Dias(wICP))	480,360,30,15,60,180,240,120
2	Med(SpectralEnergy(wICP)_0-1Hz)	360,480,30,240,120,180,60,15
3	Med(IcpPulse_Mean(wICP))	480,360,60,120,15,240
18	Med(SpectralEnergy(wICP)_9-12Hz)	480,360
26	Med(IcpPulse_Slope(wICP))	480,180,360
33	Med(SpectralEnergy(wICP)_6-9Hz)	480
36	Med(IcpPulse_DecayTimeConst(wICP))	480
50	Iqr(SpectralEnergy(wICP)_9-12Hz)	480
52	Med(IcpPulse_L2(wICP))	480
61	Med(SpectralEnergy(wICP)_12-15Hz)	480
68	Med(IcpPulse_Curve1(wICP))	480
76	Med(IcpPulse_L3(wICP))	480
83	Med(SpectralEnergy(wICP)_9-12Hz)	360
85	Iqr(IcpPulse_DP13(wICP))	480
93	Iqr(SpectralEnergy(wICP)_12-15Hz)	480
99	Med(IcpPulse_DP3(wICP))	480
66	Med(AbpPulse_AverageLatency(wABP))	480
72	Med(SpectralEnergy(wABP)_12-15Hz)	480

Table 10. TABLE X: Most important cerebral auto-regulation indices extracted from time series and waveforms, among the top 100 features overall

Rank	Feature descriptor	Most important scales
15	Med(AmpIndex(CPP,ABPm))	480
31	Iqr(SlowWaveIndex(ICP))	360,480
39	Med(PrxIndex(ICP,CPP,ABPm))	480
44	Iqr(TFIndex(wICP,wABP))	480,360
64	Med(TFIndex(CPP,ABPm))	480,360
77	Iqr(SlowWaveIndex(wICP))	480

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Forecasting intracranial hypertension using multi-scale waveform metrics

Matthias Hüser, Adrian Kündig, Walter Karlen, Valeria De Luca, and Martin Jaggi M. Hüser is with the Biomedical Informatics Group, Department of Computer Science, ETH Zürich, Universitätstrasse 6, 8092 Zürich, Switzerland (correspondence e-mail: [email protected]). A. Kündig was with the Department of Computer Science, ETH Zürich, 8092 Zürich, Switzerland. W. Karlen is with the Mobile Health Systems Lab, Institute of Robotics and Intelligent Systems, Department of Health Sciences and Technology, ETH Zürich, 8008 Zürich, Switzerland. V. De Luca was with the Department of Information Technology and Electrical Engineering, ETH Zürich, 8092 Zürich, Switzerland. She is now with the Novartis Institutes for Biomedical Research, 4056 Basel, Switzerland. M. Jaggi is with the Machine Learning & Optimization Lab, EPFL, 1015 Lausanne, Switzerland.

Abstract

Objective: Acute intracranial hypertension is an important risk factor of secondary brain damage after traumatic brain injury. Hypertensive episodes are often diagnosed reactively, leading to late detection and lost time for intervention planning. A pro-active approach that predicts critical events several hours ahead of time could assist in directing attention to patients at risk. Approach: We developed a prediction framework that forecasts onsets of acute intracranial hypertension in the next 8 hours. It jointly uses cerebral auto-regulation indices, spectral energies and morphological pulse metrics to describe the neurological state of the patient. One-minute base windows were compressed by computing signal metrics, and then stored in a multi-scale history, from which physiological features were derived. Main results: Our model predicted events up to 8 hours in advance with alarm recall rates of 90% at a precision of 30.3% in the MIMIC-III waveform database, improving upon two baselines from the literature. We found that features derived from high-frequency waveforms substantially improved the prediction performance over simple statistical summaries of low-frequency time series, and each of the three feature classes contributed to the performance gain. The inclusion of long-term history up to 8 hours was especially important. Significance: Our results highlight the importance of information contained in high-frequency waveforms in the neurological intensive care unit. They could motivate future studies on pre-hypertensive patterns and the design of new alarm algorithms for critical events in the injured brain.

Index Terms:

Cerebral auto-regulation indices, Intracranial hypertension, Intracranial pressure, Machine learning, ICP pulse morphology

I Introduction

With at least 10 million cases annually leading to hospitalization worldwide, traumatic brain injury (TBI), often causing intracranial hemorrhage, is a major public health issue [1]. After initial admission to the intensive care unit (ICU) and assessment of the primary brain injury, further neurological damage often occurs. This phenomenon is referred to as secondary brain injury, and often leads to long-term brain damage through e.g. cerebral ischemia (decrease of blood flow to the brain) [2], cerebral hypoxia (decrease of substrate/oxygen flow to the brain) [3] and brain herniation (swelling leading to compression of brain structures [4]).

Management of TBI patients in the neurological ICU focuses on mitigating and possibly reversing secondary injuries [5]. A key variable in the management of secondary brain injury is intracranial pressure (ICP) [6, 7]. Cerebral compliance maintains blood- and energy substrate flow by holding pressure constant against slight volume changes of the cranial components [8]. The ICP value of a healthy adult is maintained by this mechanism in the range 7-15 mmHg [9]. However, if compliance is reduced, rapid non-linear ICP elevations can occur [10]. A sustained ICP elevation over 20 mmHg is defined as acute intracranial hypertension (ICH) [11]. An illustrative example of an ICH event is shown in Fig. 1.

A direct association of time spent in the ICH state with clinical outcome has been empirically shown: The area under the ICP curve in the first 48 hours of ICU treatment is an independent predictor of in-hospital mortality [12]. Various other studies have established an association of ICH and poor neurological outcome [13, 14, 15]. Accordingly, it is a common treatment goal in neuro-critical care to avoid acute intracranial hypertension [7]. Invasive, intra-parenchymal ICP monitoring combined with interventions such as external ventricular drainage or surgery is the gold standard to control and maintain ICP in the physiological range of 7-15 mmHg and ensure adequate cerebral compliance [9]. Advances in monitoring and signal processing technology have allowed to record high-frequency ICP traces and analyze them in real-time [16]. Yet there are several caveats that hinder the interpretation of the ICP signal and its use as a decision-support tool: (a) raw data and time-varying trends are presented to the clinician, and no risk estimates for ICH are available. This can lead to information overload and over-consumption of human attention for the ICU personell. For example, a study has found that clinicians are often not confident that effort spent on inspection of ICP traces is redeemed by improving outcome after TBI [17]; (b) threshold-based track-and-trigger systems usually have too high false alarm rates, which can desensitize staff to dangerous hypertension events [18]; (c) alarms are only triggered after onset of acute intracranial hypertension, when long-term effects might be harder to prevent.

To address these problems, robust forecasting of ICH onsets could augment the current treatment protocol which is reactive in nature. Previous works have shown that complex precursor patterns occur in auto-regulation indices and ICP/ABP waveform morphology prior to hypertensive events [19, 20]. Recently, simple prediction models explicitly targeting ICH forecasting 30 minutes up to 6 hours before the event were proposed, yielding promising results [21, 22, 23]. However, it is not well understood or investigated which of these approaches is necessary or sufficient to achieve high prediction performance in the context of an early warning system for ICH, and whether additional benefits could be derived from their combination.

In this extensive empirical study of ICH prediction we make the following contributions

{itemize*}

An online ICH prediction framework which describes the neurological state using multi-scale metrics of the last 8 hours of recordings, comprising classical statistical features, cerebral auto-regulation indices, frequency band energies and ICP/ABP pulse morphology computed on high-frequency waveforms. The resulting model is shown to outperform two baseline models from the literature in a controlled comparison.

By including a wide range of relevant channels and physiological feature types, we conduct the first systematic study of ICH prediction across signal channels and feature types, and thereby benchmark various pre-hypertensive patterns exploited or hypothesized in previous works.

We demonstrate clear performance benefits when including morphological and spectral energy features derived from high-frequency waveforms compared to focusing on only statistical metrics on low-frequency time series which have been often used in major recent works.

Using the state-of-the-art feature attribution technique SHAP (SHapley Additive exPlanation) we study the importance and generate rankings of different features that explain positive intracranial hypertension alarms, representing the first application of this technique to an extensive set of pre-hypertensive patterns.

Preliminary and partial versions of this work have been reported in clinical abstracts [24, 25].

II Related work

The association of information contained in high-frequency physiological waveforms/time series and elevated ICP has been studied in various works. For example, Hornero et al. [26] have found that decreased ICP signal complexity and irregularity is associated with intracranial hypertension. Fan et al. [27] identified an association between ICP variability and decreased pressure auto-regulation. Recently, it was established that characteristic patterns in various physiological channels are correlated with ICP and could thus be used to predict ICH [28]. Several auto-regulation indices defined on physiological channels were reported, such as by Zeiler et al. [29], which studied the moving correlation coefficient between ICP, ABP and CPP channels, and others [30, 31]. The relationship between auto-regulation indices and successive ICH events has been studied by Kim et al. [32]. In general, it has long been suspected that the information contained in the pulsatile ICP signal is very rich beyond simple statistical summaries [33, 34].

Besides auto-regulation indices, previous works have attempted to use morphological descriptors of the intracranial pressure pulse to predict ICH onset up to 20 minutes in advance [35, 20, 36]. More generally, morphological analysis of ICP pulses [37] has emerged as a successful approach and was used to e.g. reduce false alarm rates of ICP alarms [38] and track pulse metrics in real-time [39]. Hu et al. [40] applied cluster analysis to individual ICP pulses. Other types of features that have been proposed to summarize physiological time series include bag-of-words of physiological motifs applied to ECG/EEG time series [41] and entropy measures [42, 43]. The recently proposed ICP trajectories framework [44] uses longitudinal ICP time series to discover clinical phenotypes. Different approaches have also been proposed, based on assessing risk only from static clinical data [45] or biomarkers [46, 47, 48], instead of using historical time series.

To obviate the need for explicit feature engineering on historical time series, deep learning architectures have been proposed, which detect intracranial hypertension from the raw pulse waveform [49]. Simpler dimensionality reduction approaches, such as principal component analysis, have also been used to find non-correlated features [50] that describe ICH.

Major recent works explicitly addressing the ICH forecasting problem include the approach proposed by Güiza et al. [22], which obtained an AUROC of 0.87 for prediction of ICH in the next 30 minutes. Their analysis showed that the most predictive channel is ICP and that the most recent measurements are the most relevant features. Subsequently, their model was externally validated, resulting in similar performance [23]. Myers et al. [21] proposed a model that is able to predict ICH up to 6 hours in advance, a prediction horizon comparable to our method. It uses simple features such as the last measured ICP value or the time to the last ICH crisis. Besides tackling the classification task directly, other models have been suggested that predict the future ICP mean value, for example by using nearest-neighbor regression [51], neural networks [52, 53] or ARIMA models [54].

III Methods

III-A Physiological database

In all experiments, we have used the multi-parameter intelligent monitoring in intensive care III waveform database (MIMIC-III WFDB) [55], Version 1.0. The entire dataset consists of 67,830 records extracted from patient stays at the Beth Israel Deaconess Medical Center, Boston, MA, United States. The MIMIC-III WFDB was chosen for this study because it contains simultaneous measurements of high-frequency waveforms (125 Hz) and derived time series (1 Hz) for a range of physiological channels that are relevant to the prediction problem. Among all available channels, we selected ICP (mean intracranial pressure) , CPP (cerebral perfusion pressure), ABPm/d/s (mean/diastolic/systolic arterial blood pressures), and HR (heart rate) time series, and wICP (intracranial pressure), wABP (invasive arterial blood pressure), wPLETH (raw output of fingertip plethysmograph), wRESP (respiration waveform) and ECG waveforms (Fig. 3a). This broad range allows us to compare the relevance of different channels for predictive modeling, while ensuring that we can extract a cohort of at least 100 ICU recordings with regular sampling. Waveforms were acquired using the bedside IntelliVue Patient Monitoring system, Philips Healthcare, The Netherlands.

III-B Cohort selection

Only a small fraction of available records in the MIMIC-III WFDB contain ICP data. In a first step, we discarded all segments that have no available ICP time series, which left 1586 relevant segments. We further require a minimum recording length of 24 hours, and a missing value ratio of at most 25% for each considered waveform or time series channel. We applied these criteria to ensure that the relevance of different channels as features could be meaningfully compared, and individual channels would not be negatively affected by long stretches of missing data. After applying these criteria, 123 segments remained in the cohort. This set of recording segments was used in all reported experiments. A diagram summarizing patient exclusions and cohort definition is shown in Fig. 2. Matching to MIMIC-III clinical database records was only possible for 66 of the 123 segments, contributed by 50 unique patients, for which clinical context is provided in Table I. In terms of admission diagnosis, the cohort is homogeneous, with most patients exhibiting intracranial hemorrhage. In this work, we did not include clinical covariates into the processing pipeline to avoid unequal treatment of segments or a significant reduction of available segments. Overall, our data-set contains 10,547 hours of data (Fig. 3a). Each segment has a mean recording length of 85.8 hours (std: 54.8 hours). The mean ICP value in the cohort is 9.1 mmHg (std: 6.8 mmHg).

III-C Acute intracranial hypertension alarms

We define an ICH event as 5 successive 1-minute blocks with median ICP greater than 20 mmHg, which helps to avoid spurious labelings, following the recent work by Ziai et al. [56], the earlier work by Hu et al. [20], and the definition of sustained ICH from [57]. According to this definition, patients were in an acute ICH event state for 2.5% of their cumulative segment lengths. When analyzed by segment, the ICH state was active for a mean of 2.0% (std: 7.3%) of each segment’s duration. A time point on the 1-minute grid was labeled as positive if there is any acute intracranial hypertension event in the next 8 hours and the patient is not already hypertensive (Fig. 3b). This strategy implements an early warning system deployed in phases where the patient has normal ICP values. Our design choice was to train one overall model for predicting events in the next 8 hours, without targeting any specific prediction horizon. Positive labels correspond to time points at which an alarm should be produced by our prediction model (Fig. 3b). Recall is defined as the fraction of those points, at which an alarm is indeed produced. Precision denotes the fraction of produced alarms which are in the 8 hours prior to some ICH event (Fig. 3g). Both metrics are maximized if continuous sequences of alarms are produced exactly in the 8 hours before events, one for each grid point. However, in clinical implementation this strict condition could be relaxed by applying post-processing such as moving window functions over the sequence of thresholded prediction scores. We consider such processing to be out-of-scope here, but we suspect it can improve practical alarm system performance significantly, both in terms of recall and false alarm rate. There were a total of 555,644 labeled 1-minute time points, of which 117,954 were positive, and 437,690 negative.

III-D Physiological feature extraction framework

Basic block functions

During the feature generation process, so-called basic block functions are computed online on non-overlapping windows containing 1 minute of high-frequency waveforms/time series, corresponding to 60 samples @1 Hz or 7500 samples @125 Hz (Fig. 3d). The choice of 1 minute as a basic interval makes computation of complex morphological functions tractable, increases robustness to signal artifacts and sensor detachments, and allows to produce updated predictions every minute. Before computing basic block functions, a window is pre-processed by removing physiologically implausible values. If at least half of the samples are valid, we reconstruct the remaining samples by linear interpolation. Otherwise, invalid basic blocks marked by a symbolic value are emitted (Fig. 3c). Basic block functions are then computed on valid blocks. As basic block functions we have considered statistical/complexity summaries (median, interquartile range(IQR), line length, Shannon entropy), spectral band energies of waveforms, morphological pulse summaries of the wABP and the wICP waveforms, as well as cerebral auto-regulation indices. Morphological pulse metrics are computed by an algorithm consisting of several steps (Fig. 3d). First, individual pulses on wABP/wICP are segmented, using variants of known algorithms [58, 59, 60], with the help of the ECG waveform as a reference to identify pulse onsets. Valid pulses in the window are then temporally scaled to make their lengths comparable, overlaid and averaged point-wise, yielding an averaged pulse. Morphological pulse metrics, modeled on those described by Hu et al. [20] and Almeida et al. [61], are then computed on the averaged pulse. A complete overview of the basic block functions is provided in Table II.

Multi-scale history

Computed basic block features are appended to a history buffer using an online algorithm, with one batch of features appended per minute (Fig. 3e). If a block is invalid or some features cannot be computed, for example due to missing signals, they are forward filled from the last valid feature in the history. If there is no valid feature in the recent past, the feature value is set to the median of that feature value in the accumulated history. After the history buffer is updated, a new sample of machine learning features is emitted by summarizing the current state of the history buffer (Fig. 3e). As summary functions, we use the median (location estimate of a basic feature over the history), IQR (variability of a basic feature over the history) and the slope of regression line fit (trend of a basic feature over the history). These summary functions are applied separately over the last 15, 30, 60, 120, 240, 360 and 480 minutes to capture pre-hypertensive patterns at various scales of the feature buffer history. After the full feature matrix is built, we standardize all feature columns to have zero mean and unit standard deviation, using statistics from the training data-set. Missing values are replaced by zero, which corresponds to global mean imputation. For machine learning models that can deal with missing data natively, like decision trees or tree ensembles, missing data imputation/normalization was not performed. The online signal processing and feature generation algorithms were implemented using the numerical packages SciPy and NumPy in Python 3.6.

III-E Feature interpretation using SHAP values

To gain insights into the precursor patterns of intracranial hypertension we have used SHAP value analysis [69] to uncover the most important features that explain ICH predictions. SHAP (SHapley Additive exPlanation values) is a local feature attribution method, which attributes risk scores of future intracranial hypertension to individual signal patterns, encoded in the physiological features. The SHAP value of a feature $x_{i}=k$ is the expected change of the risk score, when this feature is added to update the risk estimate, integrating over all possible subsets of other variables which are already used in the risk estimate, prior to adding the new variable. The SHAP value of a prediction at $\mathbf{x}$ for feature $i$ is defined as $s_{i,\mathbf{x}}:=E_{S}[E[f(\mathbf{x})|\mathbf{x}_{S\cup\{i\}}]-E[f(\mathbf{x})|\mathbf{x}_{S}]]$ . Here $f$ denotes the risk score, $E[f(x)|\textbf{x}_{S}]$ the conditional expectation of the risk score if the values of features in $S$ are fixed to their observed values, and $E_{S}[\cdot]$ the expectation over the choice of fixed features $S$ . The used TreeSHAP algorithm [70] is an implementation of SHAP values for tree ensembles, which can deal with missing values, and hence simplifies the computations of $s_{i,\mathbf{x}}$ . We summarized SHAP values of predictions on the validation set by defining the global importance of a feature as $g_{i}:=n^{-1}_{\text{val}}\sum_{\mathbf{x}\in\text{val}}|s_{i,\mathbf{x}}|$ , as the mean magnitude of risk score change that a particular feature causes when introduced into the model (Fig. 3f). Hereby, $n_{\text{val}}$ is the number of samples in the validation set. All features were ranked by $\{g_{i}\}$ for each split, defining the top features of the split. Ranks were averaged across splits to increase robustness of the reported feature rankings (Fig. 3g). SHAP values were also used as a feature selection method internal to each split, by discarding all but the top 100 features on the validation set before creating derived models (Fig. 3g). In this way, overfitting to non-informative features, which are numerous due to the broad range of feature combinations, is reduced. The test set was not used for feature selection or feature ranking purposes.

III-F Machine learning models

As machine learning models we have considered LogReg, a L2-regularized logistic regression model optimized using stochastic gradient descent [71]; Tree, a single decision tree; GradBoost, a gradient-boosted ensemble of decision trees [72]; and MLP, a multi-layer perceptron with a sigmoid activation function. Implementation details and hyper-parameter search grids for all machine learning models are listed in the supplementary material.

III-G Experimental design

Prediction models were evaluated using precision @75 and @90% recall, which reflects our prior belief that an alarm system for ICH should have high sensitivity, whereas false alarms are more tolerable and can be reduced with post-processing defined on top of the sequence of prediction scores (Fig. 3g). All experimental result tables report these 2 metrics as well as areas under the PR curve. 95% standard-error-based confidence intervals of performance metrics, which are used in all figures/tables, were constructed by drawing 10 randomized train/validation/test splits (proportion 40:20:40%) with respect to complete recording segments. Splits were stratified, such that the positive label prevalence of training, validation, test sets in each split is within 0.015 of the overall prevalence in the cohort. This minimizes nuisance effects for performance metrics sensitive to label prevalence (e.g. precision). The experiments performed per split are completely independent. The training set was used for model fitting, while the validation set was used for choosing the optimal set of hyperparameters, and computing mean absolute SHAP values that define the reported feature rankings (Fig. 3f). Each split is associated with a distinct feature ranking, which we integrate over in the feature importance analysis (Fig. 3h). The test set was used to compute all reported performance metrics; hyperparameters and optimal features were not selected on this set to avoid overfitting. To account for test set variability, besides training process variability, we drew 100 bootstrap samples (size 50% of test-set samples) with replacement from the test set, yielding 1000 overall replicates. Models with (indistinguishable based on overlapping 95% confidence intervals) best performance are listed in bold-face.

IV Results

Low-frequency time series channels

As a sanity check, we trained several models that do not use any features derived from high-frequency waveforms. The results, shown in the first part of Table III, indicate that ICP is the single most valuable time series across all desired recall levels. The addition of ABP/CPP context information leads to consistent performance increases for most evaluation metrics.

Importance of high-frequency waveform metrics

Taking the most performant time series model (from the first part of Table III), we tested whether adding features derived from 125 Hz waveforms has a positive effect on the prediction performance (second part of Table III). Our results indicate that adding wICP yields a marked performance increase, and the joint use with wABP strengthens this effect slightly. Using only waveform channels shows consistently higher performance than just using time series.

Morphological, spectral energy metrics and cerebral auto-regulation indices

Morphological pulse metrics, cerebral auto-regulation indices and band energy have each been shown to exhibit characteristic changes before hypertensive events in prior work. We tested whether such changes can translate into performance benefits when the corresponding features are added to a simple model. Our results are summarized in the first part of Table IV. Incremental additions of feature categories (ordered roughly by computational cost and algorithmic complexity) lead to consistent performance increases across all desired recalls.

Multi-scale history summary modes

It has been reported in the literature that variability or trends of individual metrics are predictive of ICH events. Using different history buffer summary functions, we tested whether such features are indeed valuable vs. location estimates. Our results, listed in the second part of Table IV, suggest that adding trend/variability functions to the multi-scale history provides benefits.

How much history do we need to store?

Given the benefits of complex waveform features, it is still unclear whether informative changes in pre-hypertensive patterns occur during the short- or also long-term history before the event. We tried to answer this question by ablating the set of multi-scale summary functions supported by our framework. Our results (Fig. 4) indicate that there seem to be no clear saturation effects when adding averages/trends/variability over additional length scales until a history length of 6 hours, when performance saturates.

Comparison of proposed model with baselines

In a last step, we evaluated different machine learning methods fitted on the optimal features and compared with two baselines from the literature in Tables V and VI. We simulated the method of Hu et al. [20] (BL1: ICP morphology) by computing medians of ICP pulse morphology metrics in the last 15 and 30 minutes, which is similar to their pre-hypertensive segment features. A second baseline implements the recently proposed method by Myers et al. [21] (BL2: Last 2 points + Time to last crisis) which uses as the three features the last 2 ICP values in a 30 minute window and the time since the last ICH event. If there was no such event, the last feature was set to a large symbolic value. Results show that the simplest machine learning model, i.e. LogReg, performed the best, while more complicated models like neural networks (MLP) or tree-based methods (Tree/GradBoost) provided no improvements. N/A is shown in Table V if the recall could not be achieved by a machine learning method in all splits. Our proposed model significantly outperformed the two baselines, both for PR-based and ROC-based metrics that emphasive high sensitivities.

Ranking of most important physiological metrics

By computing mean absolute SHAP values on the validation set in all 10 splits, we obtained a joint ranking of importance of individual physiological metrics. The 20 most important features for predicting ICH are listed in Table VII. Features that have identical signatures but are computed over distinct scales, are not repeated. Instead, scales that belong to the top 100 features overall are listed in the last column. Features among the top 100 for special feature categories are listed in Tables IX (wICP/wABP waveform) and X (auto-regulation indices).

As a complementary analysis to feature ablation, we also analyzed which feature categories provided important features according to rankings of mean absolute SHAP values. To enable an easier comparison, we computed the fraction of actual inclusions in the top 100 features (per split) over the number of theoretically possible inclusions. Results are summarized in Table VIII. Waveforms contribute more to highly ranked features than time series, both in absolute and relative terms. In addition, several important features are auto-regulation indices, spectral energies or morphological summaries. Finally, long-scale history summaries between 4 and 8 hours also provide many highly ranked features.

Alarm timeliness before events

The performance of the proposed model for a precision of 35% is shown in Fig. 5. To provide more insights into the behavior of a derived alarm system in clinical settings, we have analyzed the recall of desired alarms before events, conditional on the time until the ICH phase starts. This measures the timeliness of alarms given a fixed model with a constant overall false alarm rate, which is a realistic scenario of clinical implementation. We can observe a modest decay of alarm recall rates in the proposed model, which stay above 70% even 8 hours prior to the event.

V Discussion

We have designed and evaluated a prediction framework for acute intracranial hypertension events, which describes the neurological state using multi-scale descriptors of cerebral autoregulation indices, pulse morphology metrics, spectral energies and statistical summaries. Alarms before critical events were retrieved up to 8 hours before the onset of ICH (Figure 5) with an overall recall of 90 % at a precision of 30.3 %. By mainly analyzing the system using recall/precision we have chosen metrics that more easily translate to the clinical deployment of alarm systems than ROC-based metrics. The achieved AUROC score of 0.771 for 8h forecasting is comparable to the work of Myers et al. [21], which is to our knowledge the only published work with a similar forecasting horizon of 6 hours. Yet, direct comparison is not easily possible due to differences in cohort size (123 vs. 817 segments) and label definitions. In this work, we implemented the method in [21] to enable a controlled comparison on the same dataset, see Table VI.

Limitations of our study include the inability to match all 123 recording segments to the MIMIC-III clinical database, and hence take clinical context information into account for label and experimental split definition. With respect to labels, by not taking clinical interventions into account, we suspect that we are predicting cases of acute intracranial hypertension that care providers did not anticipate in time before the patient entered the ICH state, which corresponds to scenarios where an early warning system is used as a complementary decision support tool. To analyze this effect in detail, prospectively collected data which documents all clinical interventions and provides contextual information on the ICH events would be required. In our data splits, training, validation and test sets are guaranteed to be temporally disjoint, without interaction between labels and clinical features. Focusing on patients that have comprehensive monitoring with few missing values in the data-set could bias the cohort towards more intensively monitored patients. We chose to apply this criterion regardless because it allows more meaningful conclusions about the utility of different feature channels for predictive modeling.

A novel perspective on the design of ICH alarm systems is provided by the results in Table III. While building a model just using averaged time series defining the event status (ICP) provides a good baseline performance, the inclusion of richer data modalities like waveforms substantially increased performance. Including high-frequency context information (wABP) in addition to wICP increased the performance further. This hints at new independent information in the ABP waveform and supports the importance of auto-regulation indices, which are functions of two waveforms simultaneously. However, data storage and computational cost associated with waveform data might be considerable.

To our knowledge, there is no previous work that assesses the relative merits of different data modalities for ICH prediction. It is an interesting finding that only using waveforms performed better than only averaged time series, especially in light of recent related works, which found high performance using very simple models, e.g. using only minute-by-minute summaries of ICP.

Each individual feature category among auto-regulation indices, spectral energy, pulse morphology metrics provided marginal performance gains (Table IV) and is relevant for explaining predictions, as assessed using SHAP values (Table VIII). This shows that complex pre-hypertensive patterns previously identified can translate into relevant machine learning features in our framework.

Our results (Figure 4) show a clear trend between the length of considered history and prediction performance, which confirms the design principle of the multi-scale history, and also hints at the relative importance of long-scale physiological changes before hypertensive events, which could inform clinical studies. The same observation can be derived from the analysis of feature category importance in Table VIII, where a clear trend in importance from short-term to long-term features is visible.

The comparison of machine learning models (Table V) provides a pragmatic look at the relevance of the exact statistical learning method for predicting ICH. As has been observed also for other prediction problems in health care, simple models perform surprisingly well, and are not outperformed by models with higher complexity. We suspect that, since the feature choices already incorporate extensive domain knowledge, a simple model like logistic regression is powerful enough. Given the similar performance of MLP and LogReg, we did not consider the construction of more complex architectures like RNNs or CNNs for this study, which are harder to interpret than classical models, where interpretation of model features is an important focus of this study.

VI Conclusion

We presented an online machine learning and signal processing framework that forecasts onsets of acute intracranial hypertension up to 8 hours in advance. Using an extensive series of ablation studies we have shed light on the critical components of the framework. SHAP value analysis provided a second perspective on the importance of different feature categories in explaining predictions as well as a ranking of discriminative pattern changes before acute intracranial hypertension. Both perspectives highlight the importance of information derived from waveforms, which provided a substantial performance increase. Our method out-performed two baselines from the literature, which use ICP pulse morphology and 3 simple features of the ICP time series, respectively.

Directions of future work includes more sophisticated artifact detection methods at the block level, to minimize the corruption of down-stream feature generation, which is highly sensitive to accurate input signals. Exploring the per-sample SHAP values could provide interpretable reasons for predictions of future ICH events, visualize regions of interest in the history, flag abnormal physiological indices that could precede ICH, as well as generate hypotheses for future studies on the phenomenon. Furthermore, our method could be extended by providing a calibrated alarm system on top of the prediction scores, which triggers alarms at the bedside as a function of sequences of prediction scores. This would be an important step towards clinical implementation of our proposed approach, which was conceived as a real-time algorithm that could directly use data streamed from sensors. Collecting prospective clinical data to refine the labeling using information on clinical interventions as well as adding clinical co-variates like diagnosis or clinical note concepts could, we suspect, increase the performance of our model even further. Finally, we expect that our framework could be applied to predict other critical events occurring in the injured brain.

Acknowledgments

This work was supported by the Gebert-Rüf Stiftung, Switzerland, under grant agreement GRS-025/14 and the Swiss National Science Foundation, under grant agreement 150640. MH was partially funded by the Grant No. 205321_176005 “Novel Machine Learning Approaches for Data from the Intensive Care Unit” of the Swiss National Science Foundation (to Gunnar Rätsch). We acknowledge Emanuela Keller, head of the Neurocritical Care Unit at the University Hospital Zürich, Switzerland, for providing indispensible clinical insights and motivation for this work. MH gratefully acknowledges helpful discussions with Panagiotis Farantatos, Viktor Gal, Stephanie Hyland, Xinrui Lyu, Ngoc M. Pham and Gunnar Rätsch.

Bibliography72

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. A. Langlois et al. , “The epidemiology and impact of traumatic brain injury: a brief overview,” J. Head Trauma Rehabil. , vol. 21, no. 5, pp. 375–378, 2006.
2[2] H. M. Bramlett and W. D. Dietrich, “Pathophysiology of cerebral ischemia and brain trauma: similarities and differences,” J. Cereb. Blood Flow Metab. , vol. 24, no. 2, pp. 133–150, Feb. 2004.
3[3] M. Oddo et al. , “Brain hypoxia is associated with short-term outcome after severe traumatic brain injury independently of intracranial hypertension and low cerebral perfusion pressure,” Neurosurgery , vol. 69, no. 5, pp. 1037–1045, Nov. 2011. [Online]. Available: http://dx.doi.org/10.1227/NEU.0b 013e 3182287 ca 7 · doi ↗
4[4] T. Rehman et al. , “Rapid progression of traumatic bifrontal contusions to transtentorial herniation: A case report,” Cases J. , vol. 1, no. 1, p. 203, Oct. 2008. [Online]. Available: http://dx.doi.org/10.1186/1757-1626-1-203 · doi ↗
5[5] C. Werner and K. Engelhard, “Pathophysiology of traumatic brain injury,” Br. J. Anaesth. , vol. 99, no. 1, pp. 4–9, Jul. 2007. [Online]. Available: http://dx.doi.org/10.1093/bja/aem 131 · doi ↗
6[6] A. Lavinio and D. K. Menon, “Intracranial pressure: why we monitor it, how to monitor it, what to do with the number and what’s the future?” Current Opinion in Anesthesiology , vol. 24, no. 2, pp. 117–123, 2011.
7[7] N. Carney et al. , “Guidelines for the management of severe traumatic brain injury, 4th edition,” Neurosurgery , vol. 80, no. 1, pp. 6–15, Jan. 2017. [Online]. Available: https://academic.oup.com/neurosurgery/article-abstract/80/1/6/2585042
8[8] L. Rangel-Castilla et al. , “Cerebral pressure autoregulation in traumatic brain injury,” Neurosurg. Focus , vol. 25, no. 4, p. E 7, Oct. 2008. [Online]. Available: http://dx.doi.org/10.3171/FOC.2008.25.10.E 7 · doi ↗