BioData Mining

Predictive modeling of ALS progression: an XGBoost approach using clinical features
Gupta R, Bhandari M, Grover A, Al-Shehari T, Kadrie M, Alfakih T and Alsalman H
This research presents a predictive model aimed at estimating the progression of Amyotrophic Lateral Sclerosis (ALS) based on clinical features collected from a dataset of 50 patients. Important features included evaluations of speech, mobility, and respiratory function. We utilized an XGBoost regression model to forecast scores on the ALS Functional Rating Scale (ALSFRS-R), achieving a training mean squared error (MSE) of 0.1651 and a testing MSE of 0.0073, with R² values of 0.9800 for training and 0.9993 for testing. The model demonstrates high accuracy, providing a useful tool for clinicians to track disease progression and enhance patient management and treatment strategies.
Supervised multiple kernel learning approaches for multi-omics data integration
Briscik M, Tazza G, Vidács L, Dillies MA and Déjean S
Advances in high-throughput technologies have originated an ever-increasing availability of omics datasets. The integration of multiple heterogeneous data sources is currently an issue for biology and bioinformatics. Multiple kernel learning (MKL) has shown to be a flexible and valid approach to consider the diverse nature of multi-omics inputs, despite being an underused tool in genomic data mining.
Modeling heterogeneity of Sudanese hospital stay in neonatal and maternal unit: non-parametric random effect models with Gamma distribution
Almohaimeed A and Adam I
Studies looking into patient and institutional variables linked to extended hospital stays have arisen as a result of the increased focus on severe maternal morbidity and mortality. Understanding the length of hospitalization of patients after delivery is important to gain insights into when hospitals will reach capacity and to predict corresponding staffing or equipment requirements. In Sudan, the distribution of the length of stay during delivery hospitalizations is heavily skewed, with the average length of stay of 2 to 3 days. This study aimed to investigate the use of non-parametric random effect model with Gamma distributed response for analyzing skewed hospital length of stay data in Sudan in neonatal and maternal unit.
Deep joint learning diagnosis of Alzheimer's disease based on multimodal feature fusion
Wang J, Wen S, Liu W, Meng X and Jiao Z
Alzheimer's disease (AD) is an advanced and incurable neurodegenerative disease. Genetic variations are intrinsic etiological factors contributing to the abnormal expression of brain function and structure in AD patients. A new multimodal feature fusion called "magnetic resonance imaging (MRI)-p value" was proposed to construct 3D fusion images by introducing genes as a priori knowledge. Moreover, a new deep joint learning diagnostic model was constructed to fully learn images features. One branch trained a residual network (ResNet) to learn the features of local pathological regions. The other branch learned the position information of brain regions with different changes in the different categories of subjects' brains by introducing attention convolution, and then obtained the discriminative probability information from locations via convolution and global average pooling. The feature and position information of the two branches were linearly interacted to acquire the diagnostic basis for classifying the different categories of subjects. The diagnoses of AD and health control (HC), AD and mild cognitive impairment (MCI), HC and MCI were performed with data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). The results showed that the proposed method achieved optimal results in AD-related diagnosis. The classification accuracy (ACC) and area under the curve (AUC) of the three experimental groups were 93.44% and 96.67%, 89.06% and 92%, and 84% and 81.84%, respectively. Moreover, a total of six novel genes were found to be significantly associated with AD, namely NTM, MAML2, NAALADL2, FHIT, TMEM132D and PCSK5, which provided new targets for the potential treatment of neurodegenerative diseases.
Investigating potential drug targets for IgA nephropathy and membranous nephropathy through multi-queue plasma protein analysis: a Mendelian randomization study based on SMR and co-localization analysis
Xu X, Miao C, Yang S, Xiao L, Gao Y, Wu F and Xu J
Membranous nephropathy (MN) and IgA nephropathy (IgAN) pose challenges in clinical treatment with existing therapies primarily focusing on symptom relief and often yielding unsatisfactory outcomes. The search for novel drug targets remains crucial to address the shortcomings in managing both kidney diseases.
Deciphering the tissue-specific functional effect of Alzheimer risk SNPs with deep genome annotation
Pugalenthi PV, He B, Xie L, Nho K, Saykin AJ and Yan J
Alzheimer's disease (AD) is a highly heritable brain dementia, along with substantial failure of cognitive function. Large-scale genome-wide association studies (GWASs) have led to a set of SNPs significantly associated with AD and related traits. GWAS hits usually emerge as clusters where a lead SNP with the highest significance is surrounded by other less significant neighboring SNPs. Although functionality is not guaranteed even with the strongest associations in GWASs, lead SNPs have historically been the focus of the field, with the remaining associations inferred to be redundant. Recent deep genome annotation tools enable the prediction of function from a segment of a DNA sequence with significantly improved precision, which allows in-silico mutagenesis to interrogate the functional effect of SNP alleles. In this project, we explored the impact of top AD GWAS hits around APOE region on chromatin functions and whether it will be altered by the genetic context (i.e., alleles of neighboring SNPs). Our results showed that highly correlated SNPs in the same LD block could have distinct impacts on downstream functions. Although some GWAS lead SNPs showed dominant functional effects regardless of the neighborhood SNP alleles, several other SNPs did exhibit enhanced loss or gain of function under certain genetic contexts, suggesting potential additional information hidden in the LD blocks.
Transcriptome-based network analysis related to regulatory T cells infiltration identified RCN1 as a potential biomarker for prognosis in clear cell renal cell carcinoma
Qixin Y, Jing H, Jiang H, Xueyang L, Lu Y and Yuehua L
Regulatory T cells (Tregs) play a critical role in shaping the immunosuppressive microenvironment within tumors. Investigating the role of Tregs in Clear cell renal cell carcinoma (ccRCC) is crucial for identifying prognostic markers and therapeutic targets for ccRCC.
Deep learning-based Emergency Department In-hospital Cardiac Arrest Score (Deep EDICAS) for early prediction of cardiac arrest and cardiopulmonary resuscitation in the emergency department
Deng YX, Wang JY, Ko CH, Huang CH, Tsai CL and Fu LC
Timely identification of deteriorating patients is crucial to prevent the progression to cardiac arrest. However, current methods predicting emergency department cardiac arrest are primarily static, rule-based with limited precision and cannot accommodate time-series data. Deep learning has the potential to continuously update data and provide more precise predictions throughout the emergency department stay.
Decoding the genetic comorbidity network of Alzheimer's disease
Zhang X, Li D, Ye S, Liu S, Ma S, Li M, Peng Q, Hu L, Shang X, He M and Zhang L
Alzheimer's disease (AD) has emerged as the most prevalent and complex neurodegenerative disorder among the elderly population. However, the genetic comorbidity etiology for AD remains poorly understood. In this study, we conducted pleiotropic analysis for 41 AD phenotypic comorbidities, identifying ten genetic comorbidities with 16 pleiotropy genes associated with AD. Through biological functional and network analysis, we elucidated the molecular and functional landscape of AD genetic comorbidities. Furthermore, leveraging the pleiotropic genes and reported biomarkers for AD genetic comorbidities, we identified 50 potential biomarkers for AD diagnosis. Our findings deepen the understanding of the occurrence of AD genetic comorbidities and provide new insights for the search for AD diagnostic markers.
PAGER: A novel genotype encoding strategy for modeling deviations from additivity in complex trait association studies
Freda PJ, Ghosh A, Bhandary P, Matsumoto N, Chitre AS, Zhou J, Hall MA, Palmer AA, Obafemi-Ajayi T and Moore JH
The additive model of inheritance assumes that heterozygotes (Aa) are exactly intermediate in respect to homozygotes (AA and aa). While this model is commonly used in single-locus genetic association studies, significant deviations from additivity are well-documented and contribute to phenotypic variance across many traits and systems. This assumption can introduce type I and type II errors by overestimating or underestimating the effects of variants that deviate from additivity. Alternative genotype encoding strategies have been explored to account for different inheritance patterns, but they often incur significant computational or methodological costs. To address these challenges, we introduce PAGER (Phenotype Adjusted Genotype Encoding and Ranking), an efficient pre-processing method that encodes each genetic variant based on normalized mean phenotypic differences between diallelic genotype classes (AA, Aa, and aa). This approach more accurately reflects each variant's true inheritance model, improving model precision while minimizing the costs associated with alternative encoding strategies.
From COVID-19 to monkeypox: a novel predictive model for emerging infectious diseases
Xu D, Chan WH, Haron H, Nies HW and Moorthy K
The outbreak of emerging infectious diseases poses significant challenges to global public health. Accurate early forecasting is crucial for effective resource allocation and emergency response planning. This study aims to develop a comprehensive predictive model for emerging infectious diseases, integrating the blending framework, transfer learning, incremental learning, and the biological feature Rt to increase prediction accuracy and practicality. By transferring features from a COVID-19 dataset to a monkeypox dataset and introducing dynamically updated incremental learning techniques, the model's predictive capability in data-scarce scenarios was significantly improved. The research findings demonstrate that the blending framework performs exceptionally well in short-term (7-day) predictions. Furthermore, the combination of transfer learning and incremental learning techniques significantly enhanced the adaptability and precision, with a 91.41% improvement in the RMSE and an 89.13% improvement in the MAE. In particular, the inclusion of the Rt feature enabled the model to more accurately reflect the dynamics of disease spread, further improving the RMSE by 1.91% and the MAE by 2.17%. This study underscores the significant application potential of multimodel fusion and real-time data updates in infectious disease prediction, offering new theoretical perspectives and technical support. This research not only enriches the theoretical foundation of infectious disease prediction models but also provides reliable technical support for public health emergency responses. Future research should continue to explore integrating data from multiple sources and enhancing model generalization capabilities to further enhance the practicality and reliability of predictive tools.
G4 & the balanced metric family - a novel approach to solving binary classification problems in medical device validation & verification studies
Marra A
In medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews' Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.
Priority-Elastic net for binary disease outcome prediction based on multi-omics data
Musib L, Coletti R, Lopes MB, Mouriño H and Carrasquinha E
High-dimensional omics data integration has emerged as a prominent avenue within the healthcare industry, presenting substantial potential to improve predictive models. However, the data integration process faces several challenges, including data heterogeneity, priority sequence in which data blocks are prioritized for rendering predictive information contained in multiple blocks, assessing the flow of information from one omics level to the other and multicollinearity.
Assessing the limitations of relief-based algorithms in detecting higher-order interactions
Freda PJ, Ye S, Zhang R, Moore JH and Urbanowicz RJ
Epistasis, the interaction between genetic loci where the effect of one locus is influenced by one or more other loci, plays a crucial role in the genetic architecture of complex traits. However, as the number of loci considered increases, the investigation of epistasis becomes exponentially more complex, making the selection of key features vital for effective downstream analyses. Relief-Based Algorithms (RBAs) are often employed for this purpose due to their reputation as "interaction-sensitive" algorithms and uniquely non-exhaustive approach. However, the limitations of RBAs in detecting interactions, particularly those involving multiple loci, have not been thoroughly defined. This study seeks to address this gap by evaluating the efficiency of RBAs in detecting higher-order epistatic interactions. Motivated by previous findings that suggest some RBAs may rank predictive features involved in higher-order epistasis negatively, we explore the potential of absolute value ranking of RBA feature weights as an alternative approach for capturing complex interactions. In this study, we assess the performance of ReliefF, MultiSURF, and MultiSURFstar on simulated genetic datasets that model various patterns of genotype-phenotype associations, including 2-way to 5-way genetic interactions, and compare their performance to two control methods: a random shuffle and mutual information.
MDVarP: modifier ~ disease-causing variant pairs predictor
Sun H, Chen Y and Ma L
Modifiers significantly impact disease phenotypes by modulating the effects of disease-causing variants, resulting in varying disease manifestations among individuals. However, identifying genetic interactions between modifier and disease-causing variants is challenging.
Deep learning-based approaches for multi-omics data integration and analysis
Ballard JL, Wang Z, Li W, Shen L and Long Q
The rapid growth of deep learning, as well as the vast and ever-growing amount of available data, have provided ample opportunity for advances in fusion and analysis of complex and heterogeneous data types. Different data modalities provide complementary information that can be leveraged to gain a more complete understanding of each subject. In the biomedical domain, multi-omics data includes molecular (genomics, transcriptomics, proteomics, epigenomics, metabolomics, etc.) and imaging (radiomics, pathomics) modalities which, when combined, have the potential to improve performance on prediction, classification, clustering and other tasks. Deep learning encompasses a wide variety of methods, each of which have certain strengths and weaknesses for multi-omics integration.
Ensemble feature selection and tabular data augmentation with generative adversarial networks to enhance cutaneous melanoma identification and interpretability
Gómez-Martínez V, Chushig-Muzo D, Veierød MB, Granja C and Soguero-Ruiz C
Cutaneous melanoma is the most aggressive form of skin cancer, responsible for most skin cancer-related deaths. Recent advances in artificial intelligence, jointly with the availability of public dermoscopy image datasets, have allowed to assist dermatologists in melanoma identification. While image feature extraction holds potential for melanoma detection, it often leads to high-dimensional data. Furthermore, most image datasets present the class imbalance problem, where a few classes have numerous samples, whereas others are under-represented.
A regularized Cox hierarchical model for incorporating annotation information in predictive omic studies
Shen D, Lewinger JP and Kawaguchi E
Associated with high-dimensional omics data there are often "meta-features" such as biological pathways and functional annotations, summary statistics from similar studies that can be informative for predicting an outcome of interest. We introduce a regularized hierarchical framework for integrating meta-features, with the goal of improving prediction and feature selection performance with time-to-event outcomes.
Development, evaluation and comparison of machine learning algorithms for predicting in-hospital patient charges for congestive heart failure exacerbations, chronic obstructive pulmonary disease exacerbations and diabetic ketoacidosis
Arnold M, Liou L and Boland MR
Hospitalizations for exacerbations of congestive heart failure (CHF), chronic obstructive pulmonary disease (COPD) and diabetic ketoacidosis (DKA) are costly in the United States. The purpose of this study was to predict in-hospital charges for each condition using machine learning (ML) models.
Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data
Cantor E, Guauque-Olarte S, León R, Chabert S and Salas R
The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.
Identifying heterogeneous subgroups of systemic autoimmune diseases by applying a joint dimension reduction and clustering approach to immunomarkers
Chang CW, Wang HY, Lin WY, Wang YC, Lo WL, Lin TW, Yu JR and Tseng YJ
The high complexity of systemic autoimmune diseases (SADs) has hindered precise management. This study aims to investigate heterogeneity in SADs.