Reinforcement learning model for optimizing dexmedetomidine dosing to prevent delirium in critically ill patients
Delirium can result in undesirable outcomes including increased length of stays and mortality in patients admitted to the intensive care unit (ICU). Dexmedetomidine has emerged for delirium prevention in these patients; however, optimal dosing is challenging. A reinforcement learning-based Artificial Intelligence model for Delirium prevention (AID) is proposed to optimize dexmedetomidine dosing. The model was developed and internally validated using 2416 patients (2531 ICU admissions) and externally validated on 270 patients (274 ICU admissions). The estimated performance return of the AID policy was higher than that of the clinicians' policy in both derivation (0.390 95% confidence interval [CI] 0.361 to 0.420 vs. -0.051 95% CI -0.077 to -0.025) and external validation (0.186 95% CI 0.139 to 0.236 vs. -0.436 95% CI -0.474 to -0.402) cohorts. Our finding indicates that AID might support clinicians' decision-making regarding dexmedetomidine dosing to prevent delirium in ICU patients, but further off-policy evaluation is required.
Cost-effectiveness analysis of mHealth applications for depression in Germany using a Markov cohort simulation
Regulated mobile health applications are called digital health applications ("DiGA") in Germany. To qualify for reimbursement by statutory health insurance companies, DiGA have to prove positive care effects in scientific studies. Since the empirical exploration of DiGA cost-effectiveness remains largely uncharted, this study pioneers the methodology of cohort-based state-transition Markov models to evaluate DiGA for depression. As health states, we define mild, moderate, severe depression, remission and death. Comparing a future scenario where 50% of patients receive supplementary DiGA access with the current standard of care reveals a gain of 0.02 quality-adjusted life years (QALYs) per patient, which comes at additional direct costs of ~1536 EUR per patient over a five-year timeframe. Influencing factors determining DiGA cost-effectiveness are the DiGA cost structure and individual DiGA effectiveness. Under Germany's existing cost structure, DiGA for depression are yet to demonstrate the ability to generate overall savings in healthcare expenditures.
The utility of personal wearable data in long COVID and personalized patient care
Radin et al.’s recent study on patients with long COVID demonstrates that personal wearable data can provide critical insight into complex conditions. This editorial argues that research insights gained through personal wearables support the integration of personal wearables into healthcare. Challenges in incorporating wearable data in the clinic point towards AI data sorting, data sharing, device interoperability, FDA oversight, and expanded insurance coverage as first steps towards addressing these challenges.
Developing a Canadian artificial intelligence medical curriculum using a Delphi study
The integration of artificial intelligence (AI) education into medical curricula is critical for preparing future healthcare professionals. This research employed the Delphi method to establish an expert-based AI curriculum for Canadian undergraduate medical students. A panel of 18 experts in health and AI across Canada participated in three rounds of surveys to determine essential AI learning competencies. The study identified key curricular components across ethics, law, theory, application, communication, collaboration, and quality improvement. The findings demonstrate substantial support among medical educators and professionals for the inclusion of comprehensive AI education, with 82 out of 107 curricular competencies being deemed essential to address both clinical and educational priorities. It additionally provides suggestions on methods to integrate these competencies within existing dense medical curricula. The endorsed set of objectives aims to enhance AI literacy and application skills among medical students, equipping them to effectively utilize AI technologies in future healthcare settings.
Post-marketing surveillance of anticancer drugs using natural language processing of electronic medical records
This study demonstrates that adverse events (AEs) extracted using natural language processing (NLP) from clinical texts reflect the known frequencies of AEs associated with anticancer drugs. Using data from 44,502 cancer patients at a single hospital, we identified cases prescribed anticancer drugs (platinum, PLT; taxane, TAX; pyrimidine, PYA) and compared them to non-treatment (NTx) group using propensity score matching. Over 365 days, AEs (peripheral neuropathy, PN; oral mucositis, OM; taste abnormality, TA; appetite loss, AL) were extracted from clinical text using an NLP tool. The hazard ratios (HRs) for the anticancer drugs were: PN, 1.15-1.95; OM, 3.11-3.85; TA, 3.48-4.71; and AL, 1.98-3.84; the HRs were significantly higher than that of the NTx group. Sensitivity analysis revealed that the HR for TA may have been underestimated; however, the remaining three types of AEs extracted from clinical text by NLP were consistently associated with the three anticancer drugs.
A data-driven framework for identifying patient subgroups on which an AI/machine learning model may underperform
A fundamental goal of evaluating the performance of a clinical model is to ensure it performs well across a diverse intended patient population. A primary challenge is that the data used in model development and testing often consist of many overlapping, heterogeneous patient subgroups that may not be explicitly defined or labeled. While a model's average performance on a dataset may be high, the model can have significantly lower performance for certain subgroups, which may be hard to detect. We describe an algorithmic framework for identifying subgroups with potential performance disparities (AFISP), which produces a set of interpretable phenotypes corresponding to subgroups for which the model's performance may be relatively lower. This could allow model evaluators, including developers and users, to identify possible failure modes prior to wide-scale deployment. We illustrate the application of AFISP by applying it to a patient deterioration model to detect significant subgroup performance disparities, and show that AFISP is significantly more scalable than existing algorithmic approaches.
Interpretable machine learning model for digital lung cancer prescreening in Chinese populations with missing data
We developed an interpretable model, BOUND (Bayesian netwOrk for large-scale lUng caNcer Digital prescreening), using a comprehensive EHR dataset from the China to improve lung cancer detection rates. BOUND employs Bayesian network uncertainty inference, allowing it to predict lung cancer risk even with missing data and identify high-risk factors. Developed using data from 905,194 individuals, BOUND achieved an AUC of 0.866 in internal validation, with time- and geography-based external validations yielding AUCs of 0.848 and 0.841, respectively. In datasets with 10%-70% missing data, AUC ranged from 0.827 - 0.746. The model demonstrates strong calibration, clinical utility, and robust performance in both balanced and imbalanced datasets. A risk scorecard was also created, improving detection rates up to 6.8 times, available free online ( https://drzhang1.aiself.net/ ). BOUND enables non-radiative, cost-effective lung cancer prescreening, excels with missing data, and addresses treatment inequities in resource-limited primary healthcare settings.
Simulated misuse of large language models and clinical credit systems
In the future, large language models (LLMs) may enhance the delivery of healthcare, but there are risks of misuse. These methods may be trained to allocate resources via unjust criteria involving multimodal data - financial transactions, internet activity, social behaviors, and healthcare information. This study shows that LLMs may be biased in favor of collective/systemic benefit over the protection of individual rights and could facilitate AI-driven social credit systems.
The quality and safety of using generative AI to produce patient-centred discharge instructions
Patient-centred instructions on discharge can improve adherence and outcomes. Using GPT-3.5 to generate patient-centred discharge instructions, we evaluated responses for safety, accuracy and language simplification. When tested on 100 discharge summaries from MIMIC-IV, potentially harmful safety issues attributable to the AI tool were found in 18%, including 6% with hallucinations and 3% with new medications. AI tools can generate patient-centred discharge instructions, but careful implementation is needed to avoid harms.
An iterative approach for estimating domain-specific cognitive abilities from large scale online cognitive data
Online cognitive tasks are gaining traction as scalable and cost-effective alternatives to traditional supervised assessments. However, variability in peoples' home devices, visual and motor abilities, and speed-accuracy biases confound the specificity with which online tasks can measure cognitive abilities. To address these limitations, we developed IDoCT (Iterative Decomposition of Cognitive Tasks), a method for estimating domain-specific cognitive abilities and trial-difficulty scales from task performance timecourses in a data-driven manner while accounting for device and visuomotor latencies, unspecific cognitive processes and speed-accuracy trade-offs. IDoCT can operate with any computerised task where cognitive difficulty varies across trials. Using data from 388,757 adults, we show that IDoCT successfully dissociates cognitive abilities from these confounding factors. The resultant cognitive scores exhibit stronger dissociation of psychometric factors, improved cross-participants distributions, and meaningful demographic's associations. We propose that IDoCT can enhance the precision of online cognitive assessments, especially in large scale clinical and research applications.
Learning from the EHR to implement AI in healthcare
The introduction of the electronic health record was heralded as a technology solution to improve care quality and efficiency, but these tools have contributed to increased administrative burden and burnout for clinicians. Today, artificial intelligence is receiving much of the same attention and promises as electronic health records. Can healthcare learn from the failures of electronic health records to maximize the potential of artificial intelligence?
Phenotype driven molecular genetic test recommendation for diagnosing pediatric rare disorders
Patients with rare diseases often experience prolonged diagnostic delays. Ordering appropriate genetic tests is crucial yet challenging, especially for general pediatricians without genetic expertise. Recent American College of Medical Genetics (ACMG) guidelines embrace early use of exome sequencing (ES) or genome sequencing (GS) for conditions like congenital anomalies or developmental delays while still recommend gene panels for patients exhibiting strong manifestations of a specific disease. Recognizing the difficulty in navigating these options, we developed a machine learning model trained on 1005 patient records from Columbia University Irving Medical Center to recommend appropriate genetic tests based on the phenotype information. The model achieved a remarkable performance with an AUROC of 0.823 and AUPRC of 0.918, aligning closely with decisions made by genetic specialists, and demonstrated strong generalizability (AUROC:0.77, AUPRC: 0.816) in an external cohort, indicating its potential value for general pediatricians to expedite rare disease diagnosis by enhancing genetic test ordering.
Publisher Correction: Knowledge abstraction and filtering based federated learning over heterogeneous data views in healthcare
The effects of a digital health intervention on patient activation in chronic kidney disease
My Kidneys & Me (MK&M), a digital health intervention delivering specialist health and lifestyle education for people with CKD, was developed and its effects tested (SMILE-K trial, ISRCTN18314195, 18/12/2020). 420 adult patients with CKD stages 3-4 were recruited and randomised 2:1 to intervention (MK&M) (n = 280) or control (n = 140) groups. Outcomes, including Patient Activation Measure (PAM-13), were collected at baseline and 20 weeks. Complete case (CC) and per-protocol (PP) analyses were conducted. 210 (75%) participants used MK&M more than once. PAM-13 increased at 20 weeks compared to control (CC: +3.1 (95%CI: -0.2 to 6.4), P = 0.065; PP: +3.6 (95%CI: 0.2 to 7.0), P = 0.041). In those with low activation at baseline, significant between-group differences were observed (CC: +6.6 (95%CI: 1.3 to 11.9), P = 0.016; PP: +9.2 (95%CI: 4.0 to 14.6), P < 0.001) favouring MK&M group. MK&M improved patient activation in those who used the resource compared to standard care, although the overall effect was non-significant. The greatest benefits were seen in those with low activation.
A strategy for cost-effective large language model use at health system-scale
Large language models (LLMs) can optimize clinical workflows; however, the economic and computational challenges of their utilization at the health system scale are underexplored. We evaluated how concatenating queries with multiple clinical notes and tasks simultaneously affects model performance under increasing computational loads. We assessed ten LLMs of different capacities and sizes utilizing real-world patient data. We conducted >300,000 experiments of various task sizes and configurations, measuring accuracy in question-answering and the ability to properly format outputs. Performance deteriorated as the number of questions and notes increased. High-capacity models, like Llama-3-70b, had low failure rates and high accuracies. GPT-4-turbo-128k was similarly resilient across task burdens, but performance deteriorated after 50 tasks at large prompt sizes. After addressing mitigable failures, these two models can concatenate up to 50 simultaneous tasks effectively, with validation on a public medical question-answering dataset. An economic analysis demonstrated up to a 17-fold cost reduction at 50 tasks using concatenation. These results identify the limits of LLMs for effective utilization and highlight avenues for cost-efficiency at the enterprise scale.
Multisource representation learning for pediatric knowledge extraction from electronic health records
Electronic Health Record (EHR) systems are particularly valuable in pediatrics due to high barriers in clinical studies, but pediatric EHR data often suffer from low content density. Existing EHR code embeddings tailored for the general patient population fail to address the unique needs of pediatric patients. To bridge this gap, we introduce a transfer learning approach, MUltisource Graph Synthesis (MUGS), aimed at accurate knowledge extraction and relation detection in pediatric contexts. MUGS integrates graphical data from both pediatric and general EHR systems, along with hierarchical medical ontologies, to create embeddings that adaptively capture both the homogeneity and heterogeneity between hospital systems. These embeddings enable refined EHR feature engineering and nuanced patient profiling, proving particularly effective in identifying pediatric patients similar to specific profiles, with a focus on pulmonary hypertension (PH). MUGS embeddings, resistant to negative transfer, outperform other benchmark methods in multiple applications, advancing evidence-based pediatric research.
Simulating A/B testing versus SMART designs for LLM-driven patient engagement to close preventive care gaps
Population health initiatives often rely on cold outreach to close gaps in preventive care, such as overdue screenings or immunizations. Tailoring messages to diverse patient populations remains challenging, as traditional A/B testing requires large sample sizes to test only two alternative messages. With increasing availability of large language models (LLMs), programs can utilize tiered testing among both LLM and manual human agents, presenting the dilemma of identifying which patients need different levels of human support to cost-effectively engage large populations. Using microsimulations, we compared both the statistical power and false positive rates of A/B testing and Sequential Multiple Assignment Randomized Trials (SMART) for developing personalized communications across multiple effect sizes and sample sizes. SMART showed better cost-effectiveness and net benefit across all scenarios, but superior power for detecting heterogeneous treatment effects (HTEs) only in later randomization stages, when populations were more homogeneous and subtle differences drove engagement differences.
Accuracy and efficiency of drilling trajectories with augmented reality versus conventional navigation randomized crossover trial
Conventional navigation systems (CNS) in surgery require strong spatial cognitive abilities and hand-eye coordination. Augmented Reality Navigation Systems (ARNS) provide 3D guidance and may overcome these challenges, but their accuracy and efficiency compared to CNS have not been systematically evaluated. In this randomized crossover study with 36 participants from different professional backgrounds (surgeons, students, engineers), drilling accuracy, time and perceived workload were evaluated using ARNS and CNS. For the first time, this study provides compelling evidence that ARNS and CNS have comparable accuracy in translational error. Differences in angle and depth error with ARNS were likely due to limited stereoscopic vision, hardware limitations, and design. Despite this, ARNS was preferred by most participants, including surgeons with prior navigation experience, and demonstrated a significantly better overall user experience. Depending on accuracy requirements, ARNS could serve as a viable alternative to CNS for guided drilling, with potential for future optimization.
Accurately predicting mood episodes in mood disorder patients using wearable sleep and circadian rhythm features
Wearable devices enable passive collection of sleep, heart rate, and step-count data, offering potential for mood episode prediction in mood disorder patients. However, current models often require various data types, limiting real-world application. Here, we develop models that predict future episodes using only sleep-wake data, easily gathered through smartphones and wearables when trained on an individual's sleep-wake history and past mood episodes. Using mathematical modeling to longitudinal data from 168 patients (587 days average clinical follow-up, 267 days wearable data), we derived 36 sleep and circadian rhythm features. These features enabled accurate next-day predictions for depressive, manic, and hypomanic episodes (AUCs: 0.80, 0.98, 0.95). Notably, daily circadian phase shifts were the most significant predictors: delays linked to depressive episodes, advances to manic episodes. This prospective observational cohort study (ClinicalTrials.gov: NCT03088657, 2017-3-23) shows sleep-wake data, combined with prior mood episode history, can effectively predict mood episodes, enhancing mood disorder management.
Artificial intelligence assisted operative anatomy recognition in endoscopic pituitary surgery
Pituitary tumours are surrounded by critical neurovascular structures and identification of these intra-operatively can be challenging. We have previously developed an AI model capable of sellar anatomy segmentation. This study aims to apply this model, and explore the impact of AI-assistance on clinician anatomy recognition. Participants were tasked with labelling the sella on six images, initially without assistance, then augmented by AI. Mean DICE scores and the proportion of annotations encompassing the centroid of the sella were calculated. Six medical students, six junior trainees, six intermediate trainees and six experts were recruited. There was an overall improvement in sella recognition from a DICE of score 70.7% without AI assistance to 77.5% with AI assistance (+6.7; p < 0.001). Medical students used and benefitted from AI assistance the most, improving from a DICE score of 66.2% to 78.9% (+12.8; p = 0.02). This technology has the potential to augment surgical education and eventually be used as an intra-operative decision support tool.
Systematic review to understand users perspectives on AI-enabled decision aids to inform shared decision making
Artificial intelligence (AI)-enabled decision aids can contribute to the shared decision-making process between patients and clinicians through personalised recommendations. This systematic review aims to understand users' perceptions on using AI-enabled decision aids to inform shared decision-making. Four databases were searched. The population, intervention, comparison, outcomes and study design tool was used to formulate eligibility criteria. Titles, abstracts and full texts were independently screened and PRISMA guidelines followed. A narrative synthesis was conducted. Twenty-six articles were included, with AI-enabled decision aids used for screening and prevention, prognosis, and treatment. Patients found the AI-enabled decision aids easy to understand and user-friendly, fostering a sense of ownership and promoting better adherence to recommended treatment. Clinicians expressed concerns about how up-to-date the information was and the potential for over- or under-treatment. Despite users' positive perceptions, they also acknowledged certain challenges relating to the usage and risk of bias that would need to be addressed.Registration: PROSPERO database: (CRD42020220320).