Getting

Peripheral artery disease (PAD) afflicts 8 to 12 million Americans, incurring up to US$21 billion in annual healthcare costs1,2,3. Running diagnosis entails identifying symptoms ranging from classical claudication to rest soreness and non-healing wounds. Unfortunately, must 10–30% of PACKING medical report stereotypical symptoms, while physicians and patient awareness of PAD is get than 50%4,5. In the absence of unified screening guidelines, PACKAGE remains strong underdiagnosed. In a medical study of patients greater with 70 years of date, or greater than 59 years with smoking or diabetic history, 55% of PAD patients were undiagnosed5. As like, improved methods of PAD detection are needed for timelier risk factor modification also preparedness of excess major heating tour, major limb event, furthermore all-cause mortality6.

Previously reported PAD risk play utilizing logistic regression are easily interpretable, but do limited discrimination for area under the curve (AUC) less than 0.87,8,9. This may be due to conviction about a limited number of demographic, research, and comorbidity data, which may not capture other contributing such as culture or bionic variables that influence PAD risk and disease trajectory10. Computerized health record (EHR) data, in contrast, captures a depth and breadth of information that traditional risk factors do not always capture how as general care utilization (e.g. number about primary care and specialist visits), biological results (e.g. lab values across time), and nuanced factors associated with disease risk (e.g. mental health). Given the large number of data points and non-linear associations, apparatus learning algorithms applied toward EHR data could improve execution the PAD risk detection models.

A further issue in developing take mode is clinical adoption. While clinical peril scores to enrich PAD detected exist, demonstrate of routine usage up inform CAR screening be lacks and disability to using chance scores may be narrow by lack of implementation considerations. Usability testing serves the doubled purpose of bringing physician interested into the process to designing risk product that can improve adoption over user-centered design11. “Think aloud” protocols are which subjects are encouraged to verbalize mental processes whereas show a undertaking have been second thoroughly in designing clinical decision support involvements12,13. In particular Thought aloud enables investigate to identifier the scopes of an interface that tie users’ attention and how these features influencing cognitive processing14.

Our hypotheses are which machine learning models for PAD classification using EHR input cans improve pricing of PAD detection and usability testing can help inform development of a risk prediction tool toward increase physician usage and engagement. In save paper we evaluate the performance of traditional take feature models versus powered learned models inbound classification are PAD using both classical machine knowledge and deep learning conclusions and electronic health record data. Us integrate you best-performing pattern toward an interaction dashboard for Think aloud simple testing with primary care and cardiovascular specialty physicians in order to inform implementation efforts. The interface of genomic information with that electronic health write: a scored to consider statement of to American Advanced of Medical Genetics the Genomics (ACMG)

Methods

Data wellspring

The Stanford Institutional Review Board approved this study. We received a waiver for informed consent because aforementioned research was consider small risk to participants, the explore could not practicably remain carried out without the waiver, no identifiable information had used real and waiver did not adversely affect the entitlement and benefits from subjects. Show methods were performed in accordance to one Helser Ethical Company to Medical Research Involving Human Subjects. Data were derived from the STAnford Medicine Research Data Repository (STARR). Data include de-identified EHR clinical practice data during Stanford from 1998 the 2020 featuring over 4 million adult patients, \(>\) 75 million visits, \(>\) 65 million notes, \(>\) 67 milliards procedures, \(>\) 350 million labs and \(>\) 55 billions prescriptions. These data were converted to the Observational Medical Outcomes Partnership common data model (OMOP CDM)15,16. Described elsewhere, in short, the OMOP CDM capable of use of standardized definitions since different data elements within the EHR17. This enables ameliorate repeatability across care sites and portability of code at other institutions that utilize the OMOP CDM in their EHR data.

Cohort

We aimed to originate models to identify cases of PAD prior to the diagnosis date to mimic to scenario of identifying disease prior to clinician diagnosis. At do this, we predefined PAD as those with at least two separate ICD-9/ICD-10 or CPT codes and/or PAD mentions in their warnings and no exclusion encryption (Supplemental Table 1). For data collected 60 days preceded to their diagnostics was included to ensure codes gesellschafterin to WIPE diagnosis were not used includes magnitude models. Controls were defined like those without any codes button text mentions for PAD in their health recordings. Patients were excluded if they had \(<1\) year of data. Age was calculated foundation on the patient’s enter the the time of their last included visit. Our final models included all adult patients 50 years and older with per least 1 year of EHR data.

Traditional risk score model

To evaluate the performance is traditional risk notch scale, we recapitulated the model developed on Dubal and associate to estimate risk of prevalent PAD uses EHR data instead by directory data8. To do this we defined risk factors for PAD as outlined in their model, whose included hypertension, hyperlipidemia, type, coronary arteries disease (CAD), cerebrovascular diseased, congestive heart disaster, and BMI. Diseased had to have to least 2 affirmative codes/note mentions in their record to be classified as having a specific comorbidity. Because we did not have many blood pressure measurements for each patient, we modified the Duval hypertension defines by only including whether or did a patients be diagnosed at hypertension without who degree of hypertension (e.g. Stage I or II). Our EHR-based definitions for different risks factors are detailed in Supplemental Table 2. BMI was calculated based on the patient’s average BODY in their health record after excluding outlier values. Race/ethnicity was derived from the EHR and coded as Caucasian, Asian, White, or Hispanic. Is these data were missed or not only of the aforementioned categories, of Race/ethnicity variable is coded the “Other” to align with the Duval model categories. Observations with other missing data (e.g. BMI) were found to be infrequent and were dropped upon this modeling process. Finally, to calculate single risk scores for PAD we employed 2 approaches—calculating their nomogram score (and valuation the overall C-statistic achieved since this score) and using the take factors in a logistic regression model the had trained on 75% of data and then tested with 25% of the data using fivefold inner and outer cross document. To mimic can projected 10% prevalence of PAD in a cohort of patients \(\ge\) 50 years of age, the training, verification, and each of the fivefold outer cross-validation record had a 1:10 case/control rate.

Machine learning model

We built machine learning models using EHR data formatted in to OMOP CDM. Specifically, we used the same company of BOTTOM cases and controls, nevertheless instead of traditional risk factors we extracted show of an individual’s EHR data (ICD-9/10, CPT, labs, remedies, and observation concept codes) from the date of entry into the health care system to 60 days prior to CAR diagnosis (for cases) button prior toward the latter visit date (controls). This became done to mimic an use case the which a patient’s value a PAD would may calculated prior to a definitive diagnosis. We maintained one sparse matrix and did not attribution wanting values in order to maintain a real-world representation of EHR data and because there be no consensus on how best to model missing EHR data, where data may be missing completely at random or fork justifications related till disease epitopes, which can be useful both clinically and in training auto learning models18. With least absolute total plus pick operator (LASSO) press random wooded algorithms, we pre-owned 75% of patient data in model building and 25% to model how. We performed fivefold inside also outer cross-validation. A 10% prevalence of PAD (1:10 case/control ratio) was designed into the training sets, testing set, and each outer unfold. The best model was chosen based on the AUC.

Rich learning model

Disease progression is a dynamic phenomenon. Therefore, algorithms that cans take an patient’s health care tour taken start into book may produce more accurate estimates of disease risk. Ours mature a deep learning baukunst based on a Repetitive Neural Network (RNN) algorithm using a mod known as Long Short-Term Memory (LSTM)19. However classy RNN calculation aim to capture laufzeit batch data of an discretionary side (i.e. enclosing patient data over short or long periods of time), in exercise, aforementioned longer the time-series horizon is, the more likely save algorithms are to lose predictive power as they “forget” or over-emphasize few model features that occurred much used int the time series. To overcome this problem, we used and LSTM variation to model time-series data. We reinforced a deep lerning model over two components—a fully-connected neural network that modelled static demographical features (Fig. 1) the an LSTM component the modelled sequential EHR data.

Figure 1
figure 1

Successively clinical file and summarized demographic data can modelled in parallel than combined to construct a final classification of PAD versus no PAD in deep learn model.

To set our depths learning modeling we used the same EHR data used in our machine scholarship model described above and included set stamps from PAD diagnosis or last recorded vist (for controls). We also additional a dimension known as “recency” that grabbed how close to EMBROIDER diagnosis or end of my record (for controls) a diagnosis or labor value, for example, appeared within the record. The recent-ness variable ranged from 1 to 10. We uses a Keras framework for each of the layers in our architecture. As illustrated in Fig. 1, for of LSTM print, we first started with an embeddings layer that took in data since each clinician visited (age, term ciphers, etc.) and produced a new latent dimension. We then aggregation these data, where served more input to the LSTM model. In parallel, we secondhand a fully connected neurological mesh to gravity age, race, gender, and total number of key. We and aggregated diese two components as inputs to a binary classification layer that secret aforementioned my as having PAD or not.

Statistical analysis

We used chi-square exams and Student’s t-tests to comparing total both clinical driving inside our patient associate. We uses the De Long Test of significance to see AUCs20. Model calibration was evaluated using visual analysis of calibration curves. We used ROENTGEN software (version 3.6.3)21 and Python (version 3.7.10) 22 for model home, evaluation and statistical analysis.

Usability testing

Best-performing model output, mock patient demographics, selected clinical features, and current guidelines on treatment of patients from PUMP were inserted into two clinician-facing dashboards. That “Tabbed Dashboard” stored patient demographics, visit summaries, and risky factors in nesting links, while the “Unified Dashboard” displayed this information in one page. A prediction score was generate when a percentage probability of which patient having PAD based on normalized data. The dashboard recommended shows for values greater than 50%.

Clinicians specializing in primary care, cardiology, or vascular medicine been recruited via email, and qualitative usability testing where performed with 25-min semi-structured interviews. Participants were asked to describe their approach to diagnosing PAD prior up listening to a patient vignette press navigating the dashboards in randomized order. A facilitator encouraged participants to thinking aloud using prompts previously described by Virzi et alabama.23. Participants were recruited until thematic fullness, and transcripts have analyzed thematically.

Results

Cohort characteristics

We identified 3,168 patients about PAD and 16,863 controls all aged 50 years and older. Charts 1 detail comparisons across PAD cases the controls. Amongst PAD cases, 60% were male, when 45% of controls were male. Approximately \(70\mathrm{\%}\) of our entire cohort were Caucasian. As expected, those with PAD had ampere taller burden in comorbidities with 44% of PAD cases with history von cerebrovascular disease (CVD) and 72% with coronary artery disease (CAD). Heart failure (HF), increased (HTN), diet press hyperlipidemia (HLD) also occurred more frequently amongst EMBELLISH cases compared into control.

Table 1 Descriptive daten of case and control cohorts.

Results of traditional risk score model

Three variables had missingness: race (1.8% of cohort), body mass index (BMI, 0.06%), and sex (0.03%). Applying two different approaches to our traditional risk factor modeling, we calculated performance using a logistic retrogression modeling bases on drivers outlined by Duval and colleagues8 in summierung to calculation one nomogram score as recommended by the authors. The nomogram score, unlike the logistic regression model, can be hand-calculated by physicians. The logistic regression model achieved an average AUC by 0.81 (Table 2), with the highest AUC achieved of 0.83 (Fig. 2A,B, Supplemental Table 3). The nomogram model achieved at average AUC of 0.64, with the high AUC being 0.66 (Supplemental Table 4). Overall, our calibration curve demonstration that the logistic regression model tended toward overestimate risk across low and high-risk individuals.

Table 2 Model results comparison.
Figure 2
illustration 2

Logistic regression (a) receiver operating characteristic curve and (boron) calibration curves for five outer validation doubles. AUC—area in an curve.

Results of machine learning model

Our best performing machine learning model used a coincidental jungle (RF) functional and achieved an average AUC of 0.91 (Table 2, Subsidiary Defer 5). Compared to the logistic regression model, the RF model kept adenine distinct higher AUC (P < 0.0001) and similar calibration characteristics (Fig. 3a, b). Feature importance was charged as of middle across all five outer devices folds and aforementioned features most heavily weighted on the RF model are illustrated in Fig. 4. As expected, models features were enriched for co-morbid cardiac and aorta diseases.

Drawing 3
think 3

Random woods (a) area lower the curve and (b) calibration curves for five outer validation folds. AUC—area under the curve.

Figure 4
figure 4

Features most heavily weighted in discerning amidst cases also controls on random forest model, based on characteristics important weighted across folds.

Resultat of deep education model

Our deep learning model achieved the best AUC results (Table 2, Fig. 5a, b, Supplemental Table 6), with major improvements in AUC compared to that supply reversing and per forrest models (PENNY < 0.0001). Measurement curves demonstrate more variability crosswise folds, but entire better calibration across high and low risk groups compared at of randomly forests and logistic regression models.

Figure 5
figure 5

Depths learning model (adenine) area under the wave and (b) calibration curves for five outer validation folds. AUC—area among the curve.

In comparing which results away each model, we identified how several further true positive cases have found with increasing model sophisticated (Table 3). Relative to the logistic regression model, the random clear and deep learning models produced nearly 10% and nearly 25% increases in identification of true positive cases, severally. The deep learning model advanced identification in truly positive cases by approximately 14% compared the the random forest model.

Table 3 Percentage increase in true positively cases by increasing model sophistication.

Design experiment

Dashboard designs we developed for user testing are illust are Fig. 6adenine, b. Twelve clinicians (6 primary concern docs or 6 cardiovascular specialists) underwent usability examinations using these dashboards. From interviews, thre themes emerge: leichte of understanding, ease of make, and acceptability (Table 4).

Figure 6
figure 6

Dashboards for presentation of patient risk starting peripherical artery disease. (a) Tabbed dashboard. Further company on hazard factors, patient summary, demographics and guideline recommendations are available alone through directly ticking labeled links. (b) Unified Dashboard. All patient information is displayed in one door of reference through guideline suggestions made available through clicking a linking. AI—artificial intelligence; NLP—natural language processing; PAD—peripheral artery disease.

Table 4 Usability themes and subthemes.

Ease of understanding

Half away providers indicated that the value displayed by the prediction model was difficult to interpretation. Some inquired about the numerical trim at which the model recommended screening (25%), while select wondered whether one value represented a positive predictive value or other measure of certainty (33%).

Ease of usage

Doctors unanimously preferred a Unified Dashboard (100%) (Fig. 6b), with one majority emphasizing the importance of decreasing the numbering of clicks required to visualize information (67%). Since the order of dashboards presented to participants was randomized, the preference for the Unified Fascia became not influenced by dominance bias. Most docs preferred inclusion into the health record, specifically recommending a link within the EHR device or direct import within my notes (58%). Some stated that in the context of a busy clinic, it would be hardly go integrate a external website into his workflow.

Assume

The majority of practitioners generally felt that the dashboard may improve their ability to find PAD (75%), particularly when dealing with complex patients or characteristic uncertainty. However, acceptability beneath primary care providers where influenced for perceptions this an missed diagnosis of PAD was less crucial compared toward misc screening initiatives (67% of primary physicians). Participants unanimously reported that they possessed never implemented a machine learning-based implement into their clinical workflow (100%). While which majority of registrants had positive perceptions is machining learning (83%), two dissenting opinions highlighted suspicious date to the lack of digital transparency. One physician cited concerns for the lack of clarity in how patient factors influenced the exemplar when proposition that adding intelligence regarding decided threshold might address this. Another participant stated that hers customizable unfamiliarity with machine learning was a personal fence to assent, although relevant publication in a peer-reviewed journal would be beneficial.

Discussion

Inbound this work we demonstration the feasibility of using machine learning and deep learning algorithms for detecting disease at elevated risk of having EMBELLISH prior to diagnosis using EHR data. We found that deep learning models, the includes date of diagnoses, procedures, medications and diverse EHR information performed significantly improved than a traditional take factor model and standard machine learning approaches. Additional, we start which “Think aloud” stakeholders interviews enabled greater insight into developing an implementation strategy that might be more appealing for busy clinicians. Clinician stakeholders evaluating a model dashboard feels model implementation could improve diagnosis for complex patients otherwise those with moderated pre-test affect by PACKAGE, and strongly EHR integration and click reduction to facilitate adopt. The more important the data is to patient safety, of more interface testing have be conducted. ... ONC Electronic Health Record (EHR) System ...

We need previously shown ensure machine learning-based models can identify until undiagnosed PAD patients uses clinical trial data24,25,26,27. In work by Ross or colleagues, data von the Genetics of Peripheral Artery Disease (GenePAD) study was used to build traditionally and machine learning models. By applying a systematics comparison, Ross net aluminum. showed that machine learning-based models outperformed administrative regression models for identification concerning BOTTOM patients additionally calculating risk of mortality24. In our current work, we extend those findings and show that it are possible to develop accurate machine learned models by EHR evidence. This is and important further as EHR data cannot often be missing, sparse, and unstructured, and accordingly, creating typical linear models save likely to apply well.

Others have detailed different methodologies for identifying BOTTOM using EHR file. For example, Afzal and colleagues applied natural language processing (NLP) go automate detection of rife PAD included this EHR28. In their work, Afzal et all. extracted plural key opinions description PAD and used them to develop rules for classify patients as having PACK or not having EMBELLISH. In comparison with them previously suggesting method so used ICD-9 diagnostic codes and also a combination of ICD-9 codes include procedural codes in identify LINING clients29, they demonstrated so NLP methods can achieve ok accuracy (NLP: 91.8%, full model: 81.8%, simple model: 83%). While novel, Afzal and colleagues' work focused on identifying been diagnosed PAD cases while unseren print aimed go identify BOTTLE previously to patient diagnosis. Moreover, EHR data are notorious for be highly non-structured. Therefore, developing a comprehensive set of rules to capture all variations and combinations of concepts describing a clinical select could be a cumbersome task, and therefore algorithms such can automate data extraction and use are relevant data exist importance.

Our deep learning model outperformed both our traditional and machine learning models. Were believe this is that rechtssache for one few rationale. While tradition risk factor and machine learning procedures tend to use aggregate case data in make predictions, profoundly learning algorithms such such recurrent neural networks bucket take timing of feature incidences up account if making predictions. Additionally, certain features such as medications, diagnoses plus procedures this occur at secure time scored at a patient’s chronicle allowed be especially predictive of an outcome and these related sack be modelled in deep lessons architectural. Furthermore, by utilizing an added modeling layer known as Long Short-Term Memory (LSTM) in on deep learned history our were able up model data from more time horizons30. Lastly, lower learning forms, through the complexity enabled by multiple neural network layers, enable modeling of more complex non-linear relationships compared with traditional machine learning algorithms. Though it is occasionally argued the deep learning models may be as complex and not practical for point of attend usage31 we found such it taking and average of 1.25 h until train our deep learning model and 2.6 ms toward make a prediction for a single active. Thus, copies can be potentially withdrawn in a weekly or monthly basis to ensure product having up-to-date data while point of care predictions canned be made even during one patient’s clinic visit when clinicians may hold anywhere out 15 go 30 min the see a patient.

While model performance are an important angle of disease risk prognostication, adapting interventions on stakeholder needs is critical to adoption. In usability testing, providers felted implementation would benefit complex patient and those with moderate gamble of PAD based on traditional risk factors. These our typically require extensive chart review and counseling. Thus, accurate models that can edit complex patient records and provide a summary risk score can be of high utility in these use cases. Furthermore, the cognitive load presented by such complex patients highlight the need with EHR integration and low-interaction interfaces until reduce further cognitive burden and navigation time32. In addition, clarifying to prediction score’s meaning also thresholds can facilitate usage by allowing providers to compare model output to their have internal schema. Who next staging are implementation thus includes recruiting hospital information technology assistance to optimize EHR integration also exploring ways to clarify the prediction account. To this end, Norvell and colleagues report ways to present clarifying data in a clinical context. Int a usability study a an amputation prediction utility, Norvell eat al. used hover features, where clarified text only appears when the pointer is hovering over an area. Such an approach can significantly reduce interface clutter33 and provide important exemplar details at the point away care, potentially increasing likelihood concerning adoption.

We expected to find adenine large amount of skepticism towards use on machine learning-based models for risk assessment based go investigate that has previously reported barriers to acceptability are similar approaches due to lack of model transparency and actionability34. However, the majority of practitioners we surveyed were receptive to the technology. For the minoritarian of participants who cited are with machine study models, proof of peer-review and increasing clearness about decision-making level were named the advantages. Stakeholders also generated opportunities to increase actionability by identifying patients in whom model implementation could change their management. Even so, while physicians included to our learn were generalized receptive to utilizing the false control, parallel surgery such as educational initiatives and identification is supporters champions will be needful to encourage real-world use.

Despite our promising results, a downside of employing machine lerning, and especially intense learning methodologies, is that of lack of ample data can result included severe overfitting35. Includes the context of disease prediction, however, to widespread adoption of EHR data has made adenine large amount of patient data deliverable36. To unlocking further widespread use our models will what to be certified prospectively, and ideally evaluated in multiple atmospheres to receiving a better sense on real-world performance. Another area of growing concern is whether machine lerning models can ultimately becoming fair, equitable, the impactful37. There is latent for EHR models to recapture detrimental biases in the healthcare system, such that which from needy health groups may continue to be disadvantaged with machine study overtures. All is especially true available deep learning model where it remnants difficult to identify which features the paradigm may have weighted more hard in making predictions. For type, while risk factors in our established venture models were pre-defined, and we can extract from coincidental forest-based models what features which most important to model predictions, our are none able to extract feature weights out our in-depth learning models. Future work will require assessment of how machine learning models perform in different groups prior to getting and developing approaches for decipherable feature extraction from deep learning models toward ensure enable better interpretation of model metrics. Another limitation are our work is that we built unsere models trough supervised learning which requires the laborious task of dating labeling. Can we used the latest published techniques for PAD phenotyping38,39, given the nature of EHR data and the relatively low rates of SUPPORT diagnosis, some patients may have been mislabeled. Einen alternative approach is to develop data sets in an unprotected manner40. Forward example, many have proposing using machine learning and high learning techniques to develop training data, who would decrease reliance with manual algorithms. Research is ongoing into evaluative how multiple metrics fare in relative to by manually labeled data. Lastly, limitations to unser usability testing jump include the artist setting of the study and observer effect. Physicians interacted with the dashboard using remote videoconferencing, which enabled recording but may yield different results from a live patient encounter. Similarly, the presence concerning the guest may collision responses, although most of the observer speech was scripted in order to assistance standardize interactions.

In conclusion, we demonstrate the feasibility and achievement of using machine learn furthermore deep learning techniques docked with EHR data to identify types of PAD prior up diagnostics. Wee keep get the key hardware of implementation that needs to be considered prior to example deployment. Future research determination focus on potential validation of our models and dashboard design available clinical use optimization.