Personalized predictions of patient outcomes during and after hospitalization using artificial intelligence

  • Hannah
  • April 7, 2020
  • Comments Off on Personalized predictions of patient outcomes during and after hospitalization using artificial intelligence

Study cohort

In the study period, there were 1,485,880 hospitalizations for 708,089 unique patients, 439,696 (62%) of whom had only 1 hospitalization recorded. The median number of hospitalizations per patient was 1 (first and third quartile (QI) [1.0, 2.0]). There were 211,022 30-day readmissions for an overall readmission rate of 14%. Among patients aged ≥65 years, the 30-day readmission rate was 16%. The median LOS, including patients in observation status and labor and delivery patients, was 2.94 days (QI [1.67, 5.34]), or if these patients are excluded, 3.71 days (QI [2.15, 6.51]). The demographic and clinical characteristics of the patient cohort are summarized in Table 1. Higher rates of 30-day readmissions were observed in patients who were older (median age 62 vs. 59 years), African American (rate of 17% vs. 13% in whites), divorced/separated or widowed (17% vs. 13% in married/partnered or single patients), on Medicare insurance (rate of 17% vs. 10% for private insurance), and had one or multiple chronic conditions such as cancer, renal disease, congestive heart failure, chronic obstructive pulmonary disease, etc. (Table 1).

Table 1 Characteristics of hospital encounters in the study sample, overall and according to readmission and extended length of stay.
Full size table

Prediction of inpatient outcomes

Thirty-day readmissions were predicted with an area under the receiver operator characteristic curve (ROC AUC, here abbreviated as simply “AUC”) of 0.76 (Supplementary Fig. 1a). The Brier score loss (BSL) was 0.11, calibration curve shown in Supplementary Fig. 1b. Average precision was 0.38 (see Supplementary Fig. 2c). Other off-the-shelf ML models, including a deep neural network, were trained on the same task, with performance generally inferior to the Gradient Boosting Machine (GBM), or in the case of the deep neural network, similar (see Supplementary Fig. 2 and Supplementary Table 1). When trained and evaluated on a smaller cohort of 300,000 hospitalizations, performance metrics were similar: AUC 0.75, BSL 0.11. The most impactful features included (ranked from the most to the least important): primary diagnosis, days between the current admission and the previous discharge, number of past admissions, LOS, total emergency department visits in the past 6 months, number of reported comorbidities, admission source, discharge disposition, and Body Mass Index (BMI) on admission and discharge, as well as others (Fig. 1a, b, see also Supplementary Fig. 3). Including more than the top ten variables in the model did not improve predictive power for the cohort overall but does allow for more specific rationale for prediction for certain patients, as well as examination of feature interactions for further exploration. Sample individualized predictions with their explanations are shown in Fig. 1c, d, and further examples are shown in Supplementary Fig. 4. The examples in Supplementary Fig. 4 show patients with comparable predicted probabilities but different compositions of features leading to these predictions.

Fig. 1: 30-Day readmission.

a Shows the most impactful features on prediction (ranked from most to least important). b Shows the distribution of the impacts of each feature on the model output. The colors represent the feature values for numeric features: red for larger values and blue for smaller. The line is made of individual dots representing each admission, and the thickness of the line is determined by the number of examples at a given value (for example, most patients have a low number of past admissions). A negative SHAP value (extending to the left) indicates a reduced probability, while a positive one (extending to the right) indicates an increased probability. For non-numeric features, such as primary diagnosis, the gray points represent specific possible values, with certain diagnoses greatly increasing or reducing the model’s output, while the majority of diagnoses have relatively mild impact on prediction. c, d Show the composition of individualized predictions for two patients. The patient in c was admitted from the emergency outpatient unit with a headache and stayed for >7 days. In addition, this patient had been hospitalized 3 times prior to this admission and had been discharged from the last admission only 8 days prior. The predicted probability of 30-day readmission (~0.30) was three times the baseline value predicted by the model (~0.1). All of the listed features increased the model’s prediction of risk by the relative amounts shown by the size of the red bars. Conversely, the patient in d was admitted for a complete uterovaginal prolapse, stayed less than a full day, and had no reported comorbidities, such as hypertension, depression, or a history of cancer. The model predicted their probability of 30-day readmission at 0.03 or roughly one-third of the baseline prediction. The top variables that contribute and will fit on the chart are shown, but the others can be queried in the live system. The model considers all variables, and SHAP reports on all variables internally, but the images are understandably truncated for visibility.

Full size image

In order to examine possible changes in causes of readmission risk as a function of time from discharge, we predicted readmission risk for several readmission thresholds and calculated SHAP (SHapley Additive exPlanation) for each. SHAP values for 3- and 7-day readmission are shown in Supplementary Fig. 5a, b, respectively. For example, 7-day readmission risk prediction achieved AUC of 0.70 with a BSL of 0.05 (Table 2). The most impactful feature remained primary diagnosis, but other features played more important roles—e.g., BlockGroup rose to second most important variable (from ninth), number of emergency department visits in the past 6 months rose to third importance from fourth, admission blood counts increased in importance, and insurance provider rose to eighth from twelfth. BMI on admission fell several places, and BMI on discharge no longer features in the top variables. The BMI variables are unique in that missing values tend to be important, in addition to extreme values, perhaps correlating with disease burden and/or hospital practices that could be further investigated.

Table 2 Performance of predictive models.
Full size table

LOS was predicted in terms of the number of days and was binarized at various thresholds. LOS in days was predicted poorly, within 3.97 days measured by root mean square error (RMSE; average LOS 2.94–3.71 days). LOS over 5 days was predicted with an AUC of 0.84 (Fig. 2a) and a BSL of 0.15 (calibration curve shown in Supplementary Fig. 1d). Average precision was 0.70 (see Supplementary Fig. 2d). When trained and evaluated on a cohort of 300,000 patients, performance was similar: AUC 0.81 and BSL 0.17. Other ML models, including a deep neural network, were trained on the same task, with performance generally inferior to the GBM (see Supplementary Fig. 2 and Supplementary Table 1). The most impactful features included the type of admission, primary diagnosis code, patient age, admission source, LOS of the most recent prior admission, medications administered in the hospital in the first 24 h, insurance, and early admission to the intensive care unit, among others shown in Fig. 2c, d. Impactful features for LOS at thresholds of 3 and 7 days are shown in Supplementary Fig. 5c, d, respectively. The AUC did not differ in these time points compared to 5 days (Table 2). Given that primary diagnosis is often assigned late in the hospital encounter or even after discharge, we trained the LOS models with and without this feature for comparison. Results are shown in Supplementary Table 1d. Overall, predictive performance was decreased, as expected. AUC for LOS > 5 days was 0.781, BSL was 0.173, and average precision was 0.640.

Fig. 2: Length of stay >5 days.
figure2

a shows the most impactful features on prediction (ranked from most to least important). b shows the distribution of the impacts of each feature on the model output. The colors represent the feature values for numeric features: red for larger values and blue for smaller. The line is made of individual dots representing each admission, and the thickness of the line is determined by the number of examples at a given value (for example, many of our patients are elderly). A negative SHAP value (extending to the left) indicates a reduced probability, while a positive one (extending to the right) indicates an increased probability. For example, advanced age increases the probability of extended length of stay (SHAP value between zero and one), while young age tends toward a SHAP value between roughly −1 and zero, corresponding to reduced probability. For non-numeric features, such as primary diagnosis, the gray points represent specific possible values, with certain diagnoses greatly increasing or reducing the model’s output, while the majority of diagnoses have relatively mild impact on prediction. c, d show the composition of individualized predictions for two patients. The 75-year-old patient in c was admitted to the inpatient service directly from a physician’s office with leakage of a heart valve graft. The patient received 32 medications in the first 24 h and has Medicare Part A insurance coverage. The model predicted that the patient’s probability of staying >5 days was 0.80, nearly four times the baseline prediction of ~0.2. The majority of the model’s prediction was based on the diagnosis, followed by the number of initial medications, and then the other variables as shown. The patient in d, on the other hand, had a predicted probability of length of stay of 0.06 or roughly one-fourth of the baseline, despite being admitted to the ICU within 24 h of admission. The major contributor to this low probability was the diagnosis of antidepressant poisoning, followed by a private insurance provider, and finally by a lack of BMI recorded in the chart for this encounter. The reasoning behind the importance of a missing value for BMI is unclear but is repeatedly apparent in several analyses and may have to do with systematic recording practices within the hospital system (see Agniel et al.19 for an exploration of this phenomenon).

Full size image

Prediction of death within 48–72 h of admission was predicted with an AUC of 0.91 and BSL of 0.001 (Table 2). However, owing to extreme class imbalance (e.g., in the testing set there were 260,518 non-deaths and 390 deaths), this was achieved by predicting non-death in every case. Strategies to produce a reliable model by addressing class imbalance, such as data oversampling, were unsuccessful. AUC and BSL do not reliably indicate model performance and applicability in this clinical setting.

Variable interactions

SHAP analysis also allows examination of interactions between variables. Key variable interactions are shown in Supplementary Figs 6 and 7. For example, high and low values of heart rate were shown to affect probability of readmission differently for patients at different ages. With older patients, there is a clearer trend toward lower heart rates on discharge contributing to lower readmission risk and higher heart rates contributing to higher readmission risk, though modestly (SHAP values from −0.1 to +0.1–0.2). With younger patients, higher discharge heart rates overall are observed, and the positive trend is more modest. This may highlight the importance of considering a variable such as heart rate in a more complete clinical setting, such as one that includes patient age and clinical reasoning (e.g., an adult is unlikely to be discharged with marked tachycardia) (Supplementary Fig. 6c). A similar finding is observed in Supplementary Fig. 7c for LOS prediction, though clinical reasoning is less likely to play a role compared with more purely physiologic phenomena: higher heart rates overall are observed for pediatric patients, and the relationship between heart rate and LOS is not observed to be as linear for pediatric patients (high and low SHAP values are observed more uniformly for given levels of tachycardia in pediatric patients).