by Shreyas Kar
This project could help researchers develop prediction models that impact the healthcare of thousands of people.
EHRs refer to the systematic collection of patient data in an electronic format as you can see here. With advances in technology, EHR data has become a staple in modern medicine with more than 95% of hospitals using EHR, improving the quality of health care and bettering patient outcomes
Increasingly, EHR data has been used as a common data source in prediction modeling applications which have wide ranging implications in everyday clinical practice. One popular application is clinical decision support, which entails developing prediction models to inform medical decisions, for example, in medical diagnosis. Yet another application is Comparative Effectiveness Research or CER which is the comparison of the benefits and risks of two drugs. The prediction modeling piece here is that a model is typically used to stratify between patient between risk groups which is often a valuable part of CER
However, one drawback to using EHR data is that most EHR systems do not comprehensively capture clinical encounters in all healthcare facilities for a particular person. Thus, the EHR record of a particular person may miss clinically important records. This gap, or discontinuity, in EHR can potentially affect the validity of EHR-based predictive models, potentially causing information bias in the many EHR applications.
The most significant study on this topic was by Lin et al. who determined that EHR-discontinuity may lead to bias in CER studies. However, the vast majority of CER studies still take into account EHR data as a whole without taking into account EHR-continuity as there currently does not exist any study in the literature quantifying the impact of EHR-discontinuity in CER. In addition, there is currently no data on the impact of EHR- discontinuity on prediction model development in particular.
To address this gap in the literature, we seek to assess and quantify how EHR-discontinuity effects model performance in EHR-based predictive modeling.
Specifically, I design machine learning (ML) prediction models to predict outcomes of clinical Importance. I then compare the model performance when ascertaining predictors/outcomes based on EHR data alone vs. that based on EHR and Insurance Claims Data, the latter of which will have less discontinuity by considering claims data. In addition, I will stratify between high and low EHR continuity as defined by a published algorithm.
To develop or train our model I use the EHR System 1 which consists of data from one tertiary hospital, 2 community hospitals and 17 primary care facilities and will use EHR System 2 for validation or testing which consists of one tertiary hospital, one community center and 16 primary care facilities. In both EHR Systems patients are followed starting from January 1 2007 until either the end of the study period, the end of their medicare coverage or until death. The dataset was obtained from the Center for Medicare & Medicaid services through a NIH-grant. Both EHR databases contained information on patient demographics, medical diagnoses and medication.
Most applications of prediction modelling using EHR focuses not on predicting outcomes in the general population, but rather on a subset of the population containing specific comorbidities. Thus, I trained our prediction models in a cohort of patients with comorbidities as defined by a standardized protocol.
The clinical information or predictors captured from EHR and EHR+claims was assessed in the baseline period. In each data source, I removed duplicate predictors and assessed 128 covariates as defined by an established protocol, spanning demographic factors, co-morbidities prior medication and health care variable. Three outcomes of clinical and public health importance were investigated: mortality, a composite cardiovascular outcome and hemorrhage.
I develop 5 models for each of the three outcomes, Logistic Regression, Lasso, Gradient Boosting, Random Forest and BART, as they are highly diverse models. For each model, I perform hyperparameter optimization, which refers to choosing the optimal parameters of a model that influence the training process, using the Grid Search Cross Validation methods.In this method, the optimal values are chosen from a grid of pre-specified hyper parameter values where performance is measured through k-fold cross validation, which entails splitting the training set into k folds and iterating through each fold where the current fold is used for testing and the rest of the k − 1 folds are used for training. The average on the k folds are used as the performance metric, and the hyperparameter choice with the best score in the performance metric is chosen.
In addition, to stratify on data completeness, the type of prediction model and data source, I used different techniques to process the predictors prior to model development. The first technique I use is Synthetic Minority Oversampling Technique or SMOTE which is the state of the art technique to handle imbalanced datasets. Second, I use a technique to reduce our predictors down from our original 128 predictors using a technique called FAMD or Factor Analysis of Mixed Data. Finally, I also test our models with no preprocessing.
I use two primary performance metrics which are commonly used in this field: Area Under the Receiver Operating Curve, which is a curve that graphs true positive rate as a function of false positive rate as I vary the classification threshold, and log-loss, which is a metric that gives an exponential loss value as the predicted output from the model gets farther away from the actual output.
The table shown below depicts the prediction model and preprocessing technique with the highest AUC in the training set in each strata, which is defined by different data completeness levels, data sources and outcomes. The model with the highest AUC in each stratum was used for assessment of the impact of EHR continuity and data source. I found that SMOTE outperformed other preprocessing techniques in the High Continuity Cohort in EHR Database. Further "No Preprocessing" outperformed preprocessing techniques for EHR+Claims database.
LR: Logistic Regression; RF : Random Forest ; GB: Gradient Boosting; NP : No Preprocessing ; SMOTE: Synthetic Minority Oversampling Technique
Table 2 shows the AUC of EHR data in different ranges of data continuity in the testing set. The model used for each continuity-outcome pair was the one with the best performance in that pair as depicted in Figure 1. The high continuity cohort (AUC = 0.7517) had a 5.73% higher AUC than the general population cohort (AUC = 0.7153) and a 10.31% higher AUC than the low continuity cohort (AUC = 0.6857) in the three clinically relevant outcomes. In addition, the low continuity cohort had 9.06% lower AUC than the high continuity cohort and a 4.27% lower AUC than the general population cohort.
The figure below depicts the Receiver Operating Curve (ROC), which plots the true positive rate (TPR) and false positive rate (FPR) as the classification threshold is changed, for each set of models by outcomes with and without k-fold cross validation. In each of the outcomes, the high continuity curve lies above the general population curve in all classification threshold levels and and the low continuity curve lies below the general population curve in all threshold levels. Thus, regardless of the threshold level, the the model restricted to the high continuity cohort performs better than the model restricted to the general population cohort which in turn performs better than the low continuity cohort.
Receiver Operating Characteristic (ROC) Curve without (first set) and with (second set) 5-fold cross validation
Table 3 demonstrates similar trends with the log-loss metric. The high continuity population (log-loss = 0.1990) had a 44.27% lower log-loss than the low continuity cohort (log-loss = 0.2871) and a 22.16% lower log-loss than the general population cohort (log-loss = 0.2431). The low continuity population had a 30.67% higher log-loss than the high continuity cohort and a 15.3% higher log-loss than the general population.
Log-Loss across outcomes and continuity levels for models trained in EHR data
Using DeLong’s test, a test to compare two machine learning models, it was found that the models trained on the same outcome in differing population cohorts were all statistically different (p << 0.01). Table 4 summarizes the specific p-values obtained from each model comparison.
P-Values for model comparison using Delong's Test
AUC and Log-Loss values for best-performing models trained in EHR + Claims Data
In addition to EHR data, machine learning models were developed in EHR data supplement with insurance claims data. This is used to represent the ”maximum potential” that the model can reach, as EHR + Claims represent the ”full data”.The figures below demonstrate the 5-fold cross validated testing AUC and log-loss values generated from training the 12 models on EHR + Claims databases. By employing claims data, there is significantly less variability in performance in data completeness levels when compared to training models in EHR only data.