BRIEFINGS IN BIOINFORMATICS - 雨日青学习小站

VGAE-CCI: variational graph autoencoder-based construction of 3D spatial cell-cell communication network

Zhang T, Zhang X, Wu Z, Ren J, Zhao Z, Zhang H, Wang G and Wang T

Cell-cell communication plays a critical role in maintaining normal biological functions, regulating development and differentiation, and controlling immune responses. The rapid development of single-cell RNA sequencing and spatial transcriptomics sequencing (ST-seq) technologies provides essential data support for in-depth and comprehensive analysis of cell-cell communication. However, ST-seq data often contain incomplete data and systematic biases, which may reduce the accuracy and reliability of predicting cell-cell communication. Furthermore, other methods for analyzing cell-cell communication mainly focus on individual tissue sections, neglecting cell-cell communication across multiple tissue layers, and fail to comprehensively elucidate cell-cell communication networks within three-dimensional tissues. To address the aforementioned issues, we propose VGAE-CCI, a deep learning framework based on the Variational Graph Autoencoder, capable of identifying cell-cell communication across multiple tissue layers. Additionally, this model can be applied to spatial transcriptomics data with missing or partially incomplete data and can clustered cells at single-cell resolution based on spatial encoding information within complex tissues, thereby enabling more accurate inference of cell-cell communication. Finally, we tested our method on six datasets and compared it with other state of art methods for predicting cell-cell communication. Our method outperformed other methods across multiple metrics, demonstrating its efficiency and reliability in predicting cell-cell communication.

View more:

Pubmed

Brief Bioinform

PMC Article

RNADiffFold: generative RNA secondary structure prediction using discrete diffusion models

Wang Z, Feng Y, Tian Q, Liu Z, Yan P and Li X

Ribonucleic acid (RNA) molecules are essential macromolecules that perform diverse biological functions in living beings. Precise prediction of RNA secondary structures is instrumental in deciphering their complex three-dimensional architecture and functionality. Traditional methodologies for RNA structure prediction, including energy-based and learning-based approaches, often depict RNA secondary structures from a static perspective and rely on stringent a priori constraints. Inspired by the success of diffusion models, in this work, we introduce RNADiffFold, an innovative generative prediction approach of RNA secondary structures based on multinomial diffusion. We reconceptualize the prediction of contact maps as akin to pixel-wise segmentation and accordingly train a denoising model to refine the contact maps starting from a noise-infused state progressively. We also devise a potent conditioning mechanism that harnesses features extracted from RNA sequences to steer the model toward generating an accurate secondary structure. These features encompass one-hot encoded sequences, probabilistic maps generated from a pre-trained scoring network, and embeddings and attention maps derived from RNA foundation model. Experimental results on both within- and cross-family datasets demonstrate RNADiffFold's competitive performance compared with current state-of-the-art methods. Additionally, RNADiffFold has shown a notable proficiency in capturing the dynamic aspects of RNA structures, a claim corroborated by its performance on datasets comprising multiple conformations.

View more:

Pubmed

Brief Bioinform

PMC Article

PartIES: a disease subtyping framework with Partition-level Integration using diffusion-Enhanced Similarities from multi-omics Data

Miao Y, Xu H and Wang S

Integrating multi-omics data helps identify disease subtypes. Many similarity-based methods were developed for disease subtyping using multi-omics data, with many of them focusing on extracting common clustering structures across multiple types of omics data, but not preserving data-type-specific clustering structures. Moreover, clustering performance of similarity-based methods is affected when similarity measures are noisy. Here we proposed PartIES, a Partition-level Integration using diffusion-Enhanced Similarities to perform disease subtyping using multi-omics data. PartIES uses diffusion to reduce noises in individual similarity/kernel matrices from individual omics data types first, and then extract partition information from diffusion-enhanced similarity matrices and integrate the partition-level similarity through a weighted average iteratively. Simulation studies showed that (1) the diffusion step enhances clustering accuracy, and (2) PartIES outperforms competing methods, particularly when omics data types provide different clustering structures. Using mRNA, long noncoding RNAs, microRNAs expression data, DNA methylation data, and somatic mutation data from The Cancer Genome Atlas project, PartIES identified subtypes in bladder urothelial carcinoma, liver hepatocellular carcinoma, and thyroid carcinoma that are most significantly associated with patient survival across all methods. Further investigations suggested that among subtype-associated genes, many of those that are highly interacting with other genes are known important cancer genes. The identified cancer subtypes also have different activity levels for some known cancer-related pathways. The R code can be accessed at https://github.com/yuqimiao/PartIES.git.

View more:

Pubmed

Brief Bioinform

PMC Article

Comprehensive human respiratory genome catalogue underlies the high resolution and precision of the respiratory microbiome

Li Y, Pan G, Wang S, Li Z, Yang R, Jiang Y, Chen Y, Li SC and Shen B

The human respiratory microbiome plays a crucial role in respiratory health, but there is no comprehensive respiratory genome catalogue (RGC) for studying the microbiome. In this study, we collected whole-metagenome shotgun sequencing data from 4067 samples and sequenced long reads of 124 samples, yielding 9.08 and 0.42 Tbp of short- and long-read data, respectively. By submitting these data with a novel assembly algorithm, we obtained a comprehensive human RGC. This high-quality RGC contains 190,443 contigs over 1 kbps and an N50 length exceeding 13 kbps; it comprises 159 high-quality and 393 medium-quality genomes, including 117 previously uncharacterized respiratory bacteria. Moreover, the RGC contains 209 respiratory-specific species not captured by the unified human gastrointestinal genome. Using the RGC, we revisited a study on a pediatric pneumonia dataset and identified 17 pneumonia-specific respiratory pathogens, reversing an inaccurate etiological conclusion due to the previous incomplete reference. Furthermore, we applied the RGC to the data of 62 participants with a clinical diagnosis of infection. Compared to the Nucleotide database, the RGC yielded greater specificity (0 versus 0.444, respectively) and sensitivity (0.852 versus 0.881, respectively), suggesting that the RGC provides superior sensitivity and specificity for the clinical diagnosis of respiratory diseases.

View more:

Pubmed

Brief Bioinform

PMC Article

tcrBLOSUM: an amino acid substitution matrix for sensitive alignment of distant epitope-specific TCRs

Postovskaya A, Vercauteren K, Meysman P and Laukens K

Deciphering the specificity of T-cell receptor (TCR) repertoires is crucial for monitoring adaptive immune responses and developing targeted immunotherapies and vaccines. To elucidate the specificity of previously unseen TCRs, many methods employ the BLOSUM62 matrix to find TCRs with similar amino acid (AA) sequences. However, while BLOSUM62 reflects the AA substitutions within conserved regions of proteins with similar functions, the remarkable diversity of TCRs means that both TCRs with similar and dissimilar sequences can bind the same epitope. Therefore, reliance on BLOSUM62 may bias detection towards epitope-specific TCRs with similar biochemical properties, overlooking those with more diverse AA compositions. In this study, we introduce tcrBLOSUMa and tcrBLOSUMb, specialized AA substitution matrices for CDR3 alpha and CDR3 beta TCR chains, respectively. The matrices reflect AA frequencies and variations occurring within TCRs that bind the same epitope, revealing that both CDR3 alpha and CDR3 beta display tolerance to a wide range of AA substitutions and differ noticeably from the standard BLOSUM62. By accurately aligning distant TCRs employing tcrBLOSUMb, we were able to improve clustering performance and capture a large number of epitope-specific TCRs with diverse AA compositions and physicochemical profiles overlooked by BLOSUM62. Utilizing both the general BLOSUM62 and specialized tcrBLOSUM matrices in existing computational tools will broaden the range of TCRs that can be associated with their cognate epitopes, thereby enhancing TCR repertoire analysis.

View more:

Pubmed

Brief Bioinform

PMC Article

CosGeneGate selects multi-functional and credible biomarkers for single-cell analysis

Liu T, Long W, Cao Z, Wang Y, He CH, Zhang L, Strittmatter SM and Zhao H

Selecting representative genes or marker genes to distinguish cell types is an important task in single-cell sequencing analysis. Although many methods have been proposed to select marker genes, the genes selected may have redundancy and/or do not show cell-type-specific expression patterns to distinguish cell types.

View more:

Pubmed

Brief Bioinform

PMC Article

Deciphering the genetic interplay between depression and dysmenorrhea: a Mendelian randomization study

Liu S, Wei Z, Carr DF and Moraros J

This study aims to explore the link between depression and dysmenorrhea by using an integrated and innovative approach that combines genomic, transcriptomic, and protein interaction data/information from various resources.

View more:

Pubmed

Brief Bioinform

PMC Article

ONDSA: a testing framework based on Gaussian graphical models for differential and similarity analysis of multiple omics networks

Chen J, Murabito JM and Lunetta KL

The Gaussian graphical model (GGM) is a statistical network approach that represents conditional dependencies among components, enabling a comprehensive exploration of disease mechanisms using high-throughput multi-omics data. Analyzing differential and similar structures in biological networks across multiple clinical conditions can reveal significant biological pathways and interactions associated with disease onset and progression. However, most existing methods for estimating group differences in sparse GGMs only apply to comparisons between two groups, and the challenging problem of multiple testing across multiple GGMs persists. This limitation hinders the ability to uncover complex biological insights that arise from comparing multiple conditions simultaneously. To address these challenges, we propose the Omics Networks Differential and Similarity Analysis (ONDSA) framework, specifically designed for continuous omics data. ONDSA tests for structural differences and similarities across multiple groups, effectively controlling the false discovery rate (FDR) at a desired level. Our approach focuses on entry-wise comparisons of precision matrices across groups, introducing two test statistics to sequentially estimate structural differences and similarities while adjusting for correlated effects in FDR control procedures. We show via comprehensive simulations that ONDSA outperforms existing methods under a range of graph structures and is a valuable tool for joint comparisons of multiple GGMs. We also illustrate our method through the detection of neuroinflammatory pathways in a multi-omics dataset from the Framingham Heart Study Offspring cohort, involving three apolipoprotein E genotype groups. It highlights ONDSA's ability to provide a more holistic view of biological interactions and disease mechanisms through multi-omics data integration.

View more:

Pubmed

Brief Bioinform

PMC Article

LIMO-GCN: a linear model-integrated graph convolutional network for predicting Alzheimer disease genes

Lin CX, Li HD and Wang J

Alzheimer's disease (AD) is a complex disease with its genetic etiology not fully understood. Gene network-based methods have been proven promising in predicting AD genes. However, existing approaches are limited in their ability to model the nonlinear relationship between networks and disease genes, because (i) any data can be theoretically decomposed into the sum of a linear part and a nonlinear part, (ii) the linear part can be best modeled by a linear model since a nonlinear model is biased and can be easily overfit, and (iii) existing methods do not separate the linear part from the nonlinear part when building the disease gene prediction model. To address the limitation, we propose linear model-integrated graph convolutional network (LIMO-GCN), a generic disease gene prediction method that models the data linearity and nonlinearity by integrating a linear model with GCN. The reason to use GCN is that it is by design naturally suitable to dealing with network data, and the reason to integrate a linear model is that the linearity in the data can be best modeled by a linear model. The weighted sum of the prediction of the two components is used as the final prediction of LIMO-GCN. Then, we apply LIMO-GCN to the prediction of AD genes. LIMO-GCN outperforms the state-of-the-art approaches including GCN, network-wide association studies, and random walk. Furthermore, we show that the top-ranked genes are significantly associated with AD based on molecular evidence from heterogeneous genomic data. Our results indicate that LIMO-GCN provides a novel method for prioritizing AD genes.

View more:

Pubmed

Brief Bioinform

PMC Article

DiMA: sequence diversity dynamics analyser for viruses

Tharanga S, Ünlü ES, Hu Y, Sjaugi MF, Çelik MA, Hekimoğlu H, Miotto O, Öncel MM and Khan AM

Sequence diversity is one of the major challenges in the design of diagnostic, prophylactic, and therapeutic interventions against viruses. DiMA is a novel tool that is big data-ready and designed to facilitate the dissection of sequence diversity dynamics for viruses. DiMA stands out from other diversity analysis tools by offering various unique features. DiMA provides a quantitative overview of sequence (DNA/RNA/protein) diversity by use of Shannon's entropy corrected for size bias, applied via a user-defined k-mer sliding window to an input alignment file, and each k-mer position is dissected to various diversity motifs. The motifs are defined based on the probability of distinct sequences at a given k-mer alignment position, whereby an index is the predominant sequence, while all the others are (total) variants to the index. The total variants are sub-classified into the major (most common) variant, minor variants (occurring more than once and of incidence lower than the major), and the unique (singleton) variants. DiMA allows user-defined, sequence metadata enrichment for analyses of the motifs. The application of DiMA was demonstrated for the alignment data of the relatively conserved Spike protein (2,106,985 sequences) of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and the relatively highly diverse pol gene (2637) of the human immunodeficiency virus-1 (HIV-1). The tool is publicly available as a web server (https://dima.bezmialem.edu.tr), as a Python library (via PyPi) and as a command line client (via GitHub).

View more:

Pubmed

Brief Bioinform

PMC Article

miCGR: interpretable deep neural network for predicting both site-level and gene-level functional targets of microRNA

Wu X, Zhang L, Tong X, Wang Y, Zhang Z, Kong X, Ni S, Luo X, Zheng M, Tang Y and Li X

MicroRNAs (miRNAs) are critical regulators in various biological processes to cleave or repress translation of messenger RNAs (mRNAs). Accurately predicting miRNA targets is essential for developing miRNA-based therapies for diseases such as cancer and cardiovascular disease. Traditional miRNA target prediction methods often struggle due to incomplete knowledge of miRNA-target interactions and lack interpretability. To address these limitations, we propose miCGR, an end-to-end deep learning framework for predicting functional miRNA targets. MiCGR employs 2D convolutional neural networks alongside an enhanced Chaos Game Representation (CGR) of both miRNA sequences and their candidate target site (CTS) on mRNA. This advanced CGR transforms genetic sequences into informative 2D graphical representations based on sequence composition and subsequence frequencies, and explicitly incorporates important prior knowledge of seed regions and subsequence positions. Unlike one-dimensional methods based solely on sequence characters, this approach identifies functional motifs within sequences, even if they are distant in the original sequences. Our model outperforms existing methods in predicting functional targets at both the site and gene levels. To enhance interpretability, we incorporate Shapley value analysis for each subsequence within both miRNA sequences and their target sites, allowing miCGR to achieve improved accuracy, particularly with more lenient CTS selection criteria. Finally, two case studies demonstrate the practical applicability of miCGR, highlighting its potential to provide insights for optimizing artificial miRNA analogs that surpass endogenous counterparts.

View more:

Pubmed

Brief Bioinform

PMC Article

The improved de Bruijn graph for multitask learning: predicting functions, subcellular localization, and interactions of noncoding RNAs

Wei Y, Zhang Q and Liu L

Noncoding RNA refers to RNA that does not encode proteins. The lncRNA and miRNA it contains play crucial regulatory roles in organisms, and their aberrant expression is closely related to various diseases. Traditional experimental methods for validating the interactions of these RNAs have limitations, and existing prediction models exhibit relatively limited functionality, relying on isolated feature extraction and performing poorly in handling various types of small sample tasks. This paper proposes an improved de Bruijn graph that can inject RNA structural information into the graph while preserving sequence information. Furthermore, the improved de Bruijn graph enables graph neural networks to learn broader dependencies and correlations among data by introducing richer edge relationships. Meanwhile, the multitask learning model, DVMnet, proposed in this paper can handle multiple related tasks, and we optimize model parameters by integrating the total loss of three tasks. This enables multitask prediction of RNA interactions, disease associations, and subcellular localization. Compared with the best existing models in this field, DVMnet has achieved the best performance with a 3% improvement in the area under the curve value and demonstrates robust results in predicting diseases and subcellular localization. The improved de Bruijn graph is also applicable to various scenarios and can unify the sequence and structural information of various nucleic acids into a single graph.

View more:

Pubmed

Brief Bioinform

PMC Article

Toward molecular diagnosis of major depressive disorder by plasma peptides using a deep learning approach

Wang J, Xi R, Wang Y, Gao H, Gao M, Zhang X, Zhang L and Zhang Y

Major depressive disorder (MDD) is a severe psychiatric disorder that currently lacks any objective diagnostic markers. Here, we develop a deep learning approach to discover the mass spectrometric features that can discriminate MDD patients from health controls. Using plasma peptides, the neural network, termed as CMS-Net, can perform diagnosis and prediction with an accuracy of 0.9441. The sensitivity and specificity reached 0.9352 and 0.9517 respectively, and the area under the curve was enhanced to 0.9634. Using the gradient-based feature importance method to interpret crucial features, we identify 28 differential peptide sequences from 14 precursor proteins (e.g. hemoglobin, immunoglobulin, albumin, etc.). This work highlights the possibility of molecular diagnosis of MDD with the aid of chemical and computer science.

View more:

Pubmed

Brief Bioinform

PMC Article

TriTan: an efficient triple nonnegative matrix factorization method for integrative analysis of single-cell multiomics data

Ma X, Lin L, Zhao Q and Iqbal M

Single-cell multiomics have opened up tremendous opportunities for understanding gene regulatory networks underlying cell states by simultaneously profiling transcriptomes, epigenomes, and proteomes of the same cell. However, existing computational methods for integrative analysis of these high-dimensional multiomics data are either computationally expensive or limited in interpretation. These limitations pose challenges in the implementation of these methods in large-scale studies and hinder a more in-depth understanding of the underlying regulatory mechanisms. Here, we propose TriTan (Triple inTegrative fast non-negative matrix factorization), an efficient joint factorization method for single-cell multiomics data. TriTan implements a highly efficient factorization algorithm, greatly improving its computational performance. Three matrix factorization produced by TriTan helps in clustering cells, identifying signature features for each cell type, and uncovering feature associations across omics, which facilitates the identification of domains of regulatory chromatin and the prediction of cell-type-specific regulatory networks. We applied TriTan to the single-cell multiomics data obtained from different technologies and benchmarked it against the state-of-the-art methods where it shows highly competitive performance. Furthermore, we showed a range of downstream analyses conducted utilizing TriTan outputs, highlighting its capacity to facilitate interpretation in biological discovery.

View more:

Pubmed

Brief Bioinform

PMC Article

BioDSNN: a dual-stream neural network with hybrid biological knowledge integration for multi-gene perturbation response prediction

Tan Y, Xie L, Yang H, Zhang Q, Luo J and Zhang Y

Studying the outcomes of genetic perturbation based on single-cell RNA-seq data is crucial for understanding genetic regulation of cells. However, the high cost of cellular experiments and single-cell sequencing restrict us from measuring the full combination space of genetic perturbations and cell types. Consequently, a bunch of computational models have been proposed to predict unseen combinations based on existing data. Among them, generative models, e.g. variational autoencoder and diffusion models, have the superiority in capturing the perturbed data distribution, but lack a biologically understandable foundation for generalization. On the other side of the spectrum, Gene Regulation Networks or gene pathway knowledge have been exploited for more reasonable generalization enhancement. Unfortunately, they do not reach a balanced processing of the two data modalities, leading to a degraded fitting ability. Hence, we propose a dual-stream architecture. Before the information from two modalities are merged, the sequencing data are learned with a generative model while three types of knowledge data are comprehensively processed with graph networks and a masked transformer, enforcing a deep understanding of single-modality data, respectively. The benchmark results show an approximate 20% reduction in terms of mean squared error, proving the effectiveness of the model.

View more:

Pubmed

Brief Bioinform

PMC Article

A versatile pipeline to identify convergently lost ancestral conserved fragments associated with convergent evolution of vocal learning

Li X, Zhu K and Zhen Y

Molecular convergence in convergently evolved lineages provides valuable insights into the shared genetic basis of converged phenotypes. However, most methods are limited to coding regions, overlooking the potential contribution of regulatory regions. We focused on the independently evolved vocal learning ability in multiple avian lineages, and developed a whole-genome-alignment-free approach to identify genome-wide Convergently Lost Ancestral Conserved fragments (CLACs) in these lineages, encompassing noncoding regions. We discovered 2711 CLACs that are overrepresented in noncoding regions. Proximal genes of these CLACs exhibit significant enrichment in neurological pathways, including glutamate receptor signaling pathway and axon guidance pathway. Moreover, their expression is highly enriched in brain tissues associated with speech formation. Notably, several have known functions in speech and language learning, including ROBO family, SLIT2, GRIN1, and GRIN2B. Additionally, we found significantly enriched motifs in noncoding CLACs, which match binding motifs of transcriptional factors involved in neurogenesis and gene expression regulation in brain. Furthermore, we discovered 19 candidate genes that harbor CLACs in both human and multiple avian vocal learning lineages, suggesting their potential contribution to the independent evolution of vocal learning in both birds and humans.

View more:

Pubmed

Brief Bioinform

PMC Article

Repun: an accurate small variant representation unification method for multiple sequencing platforms

Zheng Z, Ren Y, Chen L, Wong AOK, Li S, Yu X, Lam TW and Luo R

Ensuring a unified variant representation aligning the sequencing data is critical for downstream analysis as variant representation may differ across platforms and sequencing conditions. Current approaches typically treat variant unification as a post-step following variant calling and are incapable of measuring the correct variant representation from the outset. Aligning variant representations with the alignment before variant calling has benefits like providing reliable training labels for deep learning-based variant caller model training and enabling direct assessment of alignment quality. However, it also poses challenges due to the large number of candidates to handle. Here, we present Repun, a haplotype-aware variant-alignment unification algorithm that harmonizes the variant representation between provided variants and alignments in different sequencing platforms. Repun leverages phasing to facilitate equivalent haplotype matches between variants and alignments. Our approach reduced the comparisons between variant haplotypes and candidate haplotypes by utilizing haplotypes with read evidence to speed up the unification process. Repun achieved >99.99% precision and > 99.5% recall through extensive evaluations of various Genome in a Bottle Consortium samples encompassing three sequencing platforms: Oxford Nanopore Technology, Pacific Biosciences, and Illumina. Repun is open-source and available at (https://github.com/zhengzhenxian/Repun).

View more:

Pubmed

Brief Bioinform

PMC Article

MCGAE: unraveling tumor invasion through integrated multimodal spatial transcriptomics

Yang Y, Zhang C, Liu Z, Aihara K, Zhang C, Chen L and Wei W

Spatially Resolved Transcriptomics (SRT) serves as a cornerstone in biomedical research, revealing the heterogeneity of tissue microenvironments. Integrating multimodal data including gene expression, spatial coordinates, and morphological information poses significant challenges for accurate spatial domain identification. Herein, we present the Multi-view Contrastive Graph Autoencoder (MCGAE), a cutting-edge deep computational framework specifically designed for the intricate analysis of spatial transcriptomics (ST) data. MCGAE advances the field by creating multi-view representations from gene expression and spatial adjacency matrices. Utilizing modular modeling, contrastive graph convolutional networks, and attention mechanisms, it generates modality-specific spatial representations and integrates them into a unified embedding. This integration process is further enriched by the inclusion of morphological image features, markedly enhancing the framework's capability to process multimodal data. Applied to both simulated and real SRT datasets, MCGAE demonstrates superior performance in spatial domain detection, data denoising, trajectory inference, and 3D feature extraction, outperforming existing methods. Specifically, in colorectal cancer liver metastases, MCGAE integrates histological and gene expression data to identify tumor invasion regions and characterize cellular molecular regulation. This breakthrough extends ST analysis and offers new tools for cancer and complex disease research.

View more:

Pubmed

Brief Bioinform

PMC Article

ADM: adaptive graph diffusion for meta-dimension reduction

Feng J, Liang Y and Yu T

Dimension reduction is essential for analyzing high-dimensional data, with various techniques developed to address diverse data characteristics. However, individual methods often struggle to capture all intricate patterns and complex structures simultaneously. To overcome this limitation, we introduce ADM (Adaptive graph Diffusion for Meta-dimension reduction), a novel meta-dimension reduction method grounded in graph diffusion theory. ADM integrates results from multiple dimension reduction techniques, leveraging their individual strengths while mitigating their specific weaknesses.ADM utilizes dynamic Markov processes to transform Euclidean space results into an information space, revealing intrinsic nonlinear manifold structures that are hard to capture by conventional methods. A critical advancement in ADM is its adaptive diffusion mechanism, which dynamically selects optimal diffusion time scales for each sample, enabling effective representation of multi-scale structures. This approach generates robust, high-quality low-dimensional representations that capture both local and global data structures while reducing noise and technique-specific distortions. We demonstrate ADM's efficacy on simulated and real-world datasets, including various omics data types. Results show that ADM provides clearer separation between biological groups and reveals more meaningful patterns compared to existing methods, advancing the analysis and visualization of complex biological data.

View more:

Pubmed

Brief Bioinform

PMC Article

AutoXAI4Omics: an automated explainable AI tool for omics and tabular data

Strudwick J, Gardiner LJ, Denning-James K, Haiminen N, Evans A, Kelly J, Madgwick M, Utro F, Seabolt E, Gibson C, Bedi B, Clayton D, Howell C, Parida L and Carrieri AP

Machine learning (ML) methods offer opportunities for gaining insights into the intricate workings of complex biological systems, and their applications are increasingly prominent in the analysis of omics data to facilitate tasks, such as the identification of novel biomarkers and predictive modeling of phenotypes. For scientists and domain experts, leveraging user-friendly ML pipelines can be incredibly valuable, enabling them to run sophisticated, robust, and interpretable models without requiring in-depth expertise in coding or algorithmic optimization. By streamlining the process of model development and training, researchers can devote their time and energies to the critical tasks of biological interpretation and validation, thereby maximizing the scientific impact of ML-driven insights. Here, we present an entirely automated open-source explainable AI tool, AutoXAI4Omics, that performs classification and regression tasks from omics and tabular numerical data. AutoXAI4Omics accelerates scientific discovery by automating processes and decisions made by AI experts, e.g. selection of the best feature set, hyper-tuning of different ML algorithms and selection of the best ML model for a specific task and dataset. Prior to ML analysis AutoXAI4Omics incorporates feature filtering options that are tailored to specific omic data types. Moreover, the insights into the predictions that are provided by the tool through explainability analysis highlight associations between omic feature values and the targets under investigation, e.g. predicted phenotypes, facilitating the identification of novel actionable insights. AutoXAI4Omics is available at: https://github.com/IBM/AutoXAI4Omics.

View more:

Pubmed

Brief Bioinform

PMC Article

FunlncModel: integrating multi-omic features from upstream and downstream regulatory networks into a machine learning framework to identify functional lncRNAs

Li YY, Qian FC, Zhang GR, Li XC, Zhou LW, Yu ZM, Liu W, Wang QY and Li CQ

Accumulating evidence indicates that long noncoding RNAs (lncRNAs) play important roles in molecular and cellular biology. Although many algorithms have been developed to reveal their associations with complex diseases by using downstream targets, the upstream (epi)genetic regulatory information has not been sufficiently leveraged to predict the function of lncRNAs in various biological processes. Therefore, we present FunlncModel, a machine learning-based interpretable computational framework, which aims to screen out functional lncRNAs by integrating a large number of (epi)genetic features and functional genomic features from their upstream/downstream multi-omic regulatory networks. We adopted the random forest method to mine nearly 60 features in three categories from >2000 datasets across 11 data types, including transcription factors (TFs), histone modifications, typical enhancers, super-enhancers, methylation sites, and mRNAs. FunlncModel outperformed alternative methods for classification performance in human embryonic stem cell (hESC) (0.95 Area Under Curve (AUROC) and 0.97 Area Under the Precision-Recall Curve (AUPRC)). It could not only infer the most known lncRNAs that influence the states of stem cells, but also discover novel high-confidence functional lncRNAs. We extensively validated FunlncModel's efficacy by up to 27 cancer-related functional prediction tasks, which involved multiple cancer cell growth processes and cancer hallmarks. Meanwhile, we have also found that (epi)genetic regulatory features, such as TFs and histone modifications, serve as strong predictors for revealing the function of lncRNAs. Overall, FunlncModel is a strong and stable prediction model for identifying functional lncRNAs in specific cellular contexts. FunlncModel is available as a web server at https://bio.liclab.net/FunlncModel/.

View more:

Pubmed

Brief Bioinform

PMC Article