CLHGNNMDA: Hypergraph Neural Network Model Enhanced by Contrastive Learning for miRNA-Disease Association Prediction
Numerous biological experiments have demonstrated that microRNA (miRNA) is involved in gene regulation within cells, and mutations and abnormal expression of miRNA can cause a myriad of intricate diseases. Forecasting the association between miRNA and diseases can enhance disease prevention and treatment and accelerate drug research, which holds considerable importance for the development of clinical medicine and drug research. This investigation introduces a contrastive learning-augmented hypergraph neural network model, termed CLHGNNMDA, aimed at predicting associations between miRNAs and diseases. Initially, CLHGNNMDA constructs multiple hypergraphs by leveraging diverse similarity metrics related to miRNAs and diseases. Subsequently, hypergraph convolution is applied to each hypergraph to extract feature representations for nodes and hyperedges. Following this, autoencoders are employed to reconstruct information regarding the feature representations of nodes and hyperedges and to integrate various features of miRNAs and diseases extracted from each hypergraph. Finally, a joint contrastive loss function is utilized to refine the model and optimize its parameters. The CLHGNNMDA framework employs multi-hypergraph contrastive learning for the construction of a contrastive loss function. This approach takes into account inter-view interactions and upholds the principle of consistency, thereby augmenting the model's representational efficacy. The results obtained from fivefold cross-validation substantiate that the CLHGNNMDA algorithm achieves a mean area under the receiver operating characteristic curve of 0.9635 and a mean area under the precision-recall curve of 0.9656. These metrics are notably superior to those attained by contemporary state-of-the-art methodologies.
An Analytical Approach that Combines Knowledge from Germline and Somatic Mutations Enhances Tumor Genomic Reanalyses in Precision Oncology
Expanded analysis of tumor genomics data enables current and future patients to gain more benefits, such as improving diagnosis, prognosis, and therapeutics. Here, we report tumor genomic data from 1146 cases accompanied by simultaneous expert analysis from patients visiting our oncological clinic. We developed an analytical approach that leverages combined germline and cancer genetics knowledge to evaluate opportunities, challenges, and yield of potentially medically relevant data. We identified 499 cases (44%) with variants of interest, defined as either potentially actionable or pathogenic in a germline setting, and that were reported in the original analysis as variants of uncertain significance (VUS). Of the 7405 total unique tumor variants reported, 462 (6.2%) were reported as VUS at the time of diagnosis, yet information from germline analyses identified them as (likely) pathogenic. Notably, we find that a sizable number of these variants (36%-79%) had been reported in heritable disorders and deposited in public databases before the year of tumor testing. This finding indicates the need to develop data systems to bridge current gaps in variant annotation and interpretation and to develop more complete digital representations of actionable pathways. We outline our process for achieving such methodologic integration. Sharing genomics data across medical specialties can enable more robust, equitable, and thorough use of patient's genomics data. This comprehensive analytical approach and the new knowledge derived from its results highlight its multi-specialty value in precision oncology settings.
Optimizing Metabolite Production with Neighborhood-Based Binary Quantum-Behaved Particle Swarm Optimization and Flux Balance Analysis
Metabolic engineering is a rapidly evolving field that involves optimizing microbial cell factories to overproduce various industrial products. To achieve this, several tools, leveraging constraint-based stoichiometric models and metaheuristic algorithms like particle swarm optimization (PSO), have been developed. However, PSO can potentially get trapped in local optima. Quantum-behaved PSO (QPSO) overcomes this limitation, and our study further enhances its binary version (BQPSO) with a neighborhood topology, leading to the advanced neighborhood-based BQPSO (NBQPSO). Combined with flux balance analysis (FBA), this forms an innovative approach, NBQPSO-FBA, for identifying optimal knockout strategies to maximize the desired metabolite production. Additionally, we introduced a novel encoding strategy suitable for large-scale genome-scale metabolic models (GSMMs). Evaluated on four GSMMs (iJR904, iAF1260, iJO1366, and iML1515), NBQPSO-FBA matches or surpasses established bi-level linear programming (LP) and heuristic methods in metabolite production optimization. Notably, it achieved 90.69% realization of the theoretical maximum in acetate production and demonstrated comparable performance with leading algorithms in lactate production. The efficiency of NBQPSO-FBA, which requires fewer knockouts, makes it a practical and effective tool for optimizing microbial cell factories. This addresses the rising demand for microbial products across various industries.
Endhered Patterns in Matchings and RNA
An is a subset of arcs in matchings, such that the corresponding starting points are consecutive, and the same holds for the ending points. Such patterns are in one-to-one correspondence with the permutations. We focus on the occurrence frequency of such patterns in matchings and native (real-world) RNA structures with pseudoknots. We present combinatorial results related to the distribution and asymptotic behavior of the pattern 21, which corresponds to two consecutive base pairs frequently encountered in RNA, and the pattern 12, representing the archetypal minimal pseudoknot. We show that in matchings these two patterns are equidistributed, which is quite different from what we can find in native RNAs. We also examine the distribution of endhered patterns of size 3, showing how the patterns change under the transformation called . Finally, we compute the distributions of endhered patterns of size 2 and 3 in native secondary RNA structures with pseudoknots and discuss possible outcomes of our study.
Is Tumor Growth Influenced by the Bone Remodeling Process?
In this study, we develop a comprehensive model to investigate the intricate relationship between the bone remodeling process, tumor growth, and bone diseases such as multiple myeloma. By analyzing different scenarios within the Basic Multicellular Unit, we uncover the dynamic interplay between remodeling and tumor progression. The model developed developed in the paper are based on the well accepted Komarova's and Ayati's models for the bone remodeling process, then these models were modified to include the effects of the tumor growth. Our in silico experiments yield results consistent with existing literature, providing valuable insights into the complex dynamics at play. This research aims to improve the clinical management of bone diseases and metastasis, paving the way for targeted interventions and personalized treatment strategies to enhance the quality of life for affected individuals.
Advances in Estimating Level-1 Phylogenetic Networks from Unrooted SNPs
We address the problem of how to estimate a phylogenetic network when given single-nucleotide polymorphisms (i.e., SNPs, or bi-allelic markers that have evolved under the infinite sites assumption). We focus on level-1 phylogenetic networks (i.e., networks where the cycles are node-disjoint), since more complex networks are unidentifiable. We provide a polynomial time quartet-based method that we prove correct for reconstructing the semi-directed level-1 phylogenetic network , if we are given a set of SNPs that covers all the bipartitions of , even if the ancestral state is not known, provided that the cycles are of length at least 5; we also prove that an algorithm developed by Dan Gusfield in the in 2005 correctly recovers semi-directed level-1 phylogenetic networks in polynomial time in this case. We present a stochastic model for DNA evolution, and we prove that the two methods (our quartet-based method and Gusfield's method) are statistically consistent estimators of the semi-directed level-1 phylogenetic network. For the case of multi-state homoplasy-free characters, we prove that our quartet-based method correctly constructs semi-directed level-1 networks under the required conditions (all cycles of length at least five), while Gusfield's algorithm cannot be used in that case. These results assume that we have access to an oracle for indicating which sites in the DNA alignment are homoplasy-free, and we show that the methods are robust, under some conditions, to oracle errors.
A Joint Bayesian Model for Change-Points and Heteroskedasticity Applied to the Canadian Longitudinal Study on Aging
Maintaining homeostasis, the regulation of internal physiological parameters, is essential for health and well-being. Deviations from optimal levels, or 'sweet spots,' can lead to health deterioration and disease. Identifying biomarkers with sweet spots requires both change-point detection and variance effect analysis. Traditional approaches involve separate tests for change-points and heteroskedasticity, which can yield inaccurate results if model assumptions are violated. To address these challenges, we propose a unified approach: Bayesian Testing for Heteroskedasticity and Sweet Spots (BTHS). This framework integrates sampling-based parameter estimation and Bayes factor computation to enhance change-point detection, heteroskedasticity quantification, and testing in change-point regression settings, and extends previous Bayesian approaches. BTHS eliminates the need for separate analyses and provides detailed insights into both the magnitude and shape of heteroskedasticity, enabling robust identification of sweet spots without strong assumptions. We applied BTHS to blood elements from the Canadian Longitudinal Study on Aging identifying nine blood elements with significant sweet spot variance effects.
An Earth Mover's Distance-Based Self-Supervised Framework for Cellular Dynamic Grading in Live-Cell Imaging
Cellular appearance and its dynamics frequently serve as a proxy measurement of live-cell physiological properties. The computational analysis of cell properties is considered to be a significant endeavor in biological and biomedical research. Deep learning has garnered considerable success across various fields. In light of this, various neural networks have been developed to analyze live-cell microscopic videos and capture cellular dynamics with biological significance. Specifically, cellular dynamic grading (CDG) is the task that provides a predefined dynamic grade for a live-cell according to the speed of cellular deformation and intracellular movement. This task involves recording the morphological and cytoplasmic dynamics in live-cell microscopic videos. Similar to other medical image processing tasks, CDG faces challenges in collecting and annotating cellular videos. These deficiencies in medical data limit the performance of deep learning models. In this article, we propose a novel self-supervised framework to overcome these limitations for the CDG task. Our framework relies on the assumption that increasing or decreasing cell dynamic grades is consistent with accelerating or decelerating cell appearance change in videos, respectively. This consistency is subsequently incorporated as a constraint in the loss function for the self-supervised training strategy. Our framework is implemented by formulating a probability transition matrix based on the Earth Mover's Distance and imposing a loss constraint on the elements of this matrix. Experimental results demonstrate that our proposed framework enhances the model's ability to learn spatiotemporal dynamics. Furthermore, our framework outperforms the existing methods on our cell video database.
DRGAT: Predicting Drug Responses Via Diffusion-Based Graph Attention Network
Accurately predicting drug response depending on a patient's genomic profile is critical for advancing personalized medicine. Deep learning approaches rise and especially the rise of graph neural networks leveraging large-scale omics datasets have been a key driver of research in this area. However, these biological datasets, which are typically high dimensional but have small sample sizes, present challenges such as overfitting and poor generalization in predictive models. As a complicating matter, gene expression (GE) data must capture complex inter-gene relationships, exacerbating these issues. In this article, we tackle these challenges by introducing a drug response prediction method, called drug response graph attention network (DRGAT), which combines a denoising diffusion implicit model for data augmentation with a recently introduced graph attention network (GAT) with high-order neighbor propagation (HO-GATs) prediction module. Our proposed approach achieved almost 5% improvement in the area under receiver operating characteristic curve compared with state-of-the-art models for the many studied drugs, indicating our method's reasonable generalization capabilities. Moreover, our experiments confirm the potential of diffusion-based generative models, a core component of our method, to mitigate the inherent limitations of omics datasets by effectively augmenting GE data.
Generative Adversarial Networks for Neuroimage Translation
Image-to-image translation has gained popularity in the medical field to transform images from one domain to another. Medical image synthesis via domain transformation is advantageous in its ability to augment an image dataset where images for a given class are limited. From the learning perspective, this process contributes to the data-oriented robustness of the model by inherently broadening the model's exposure to more diverse visual data and enabling it to learn more generalized features. In the case of generating additional neuroimages, it is advantageous to obtain unidentifiable medical data and augment smaller annotated datasets. This study proposes the development of a cycle-consistent generative adversarial network (CycleGAN) model for translating neuroimages from one field strength to another (e.g., 3 Tesla [T] to 1.5 T). This model was compared with a model based on a deep convolutional GAN model architecture. CycleGAN was able to generate the synthetic and reconstructed images with reasonable accuracy. The mapping function from the source (3 T) to the target domain (1.5 T) performed optimally with an average peak signal-to-noise ratio value of 25.69 ± 2.49 dB and a mean absolute error value of 2106.27 ± 1218.37. The codes for this study have been made publicly available in the following GitHub repository..
Drug Repurposing Using Hypergraph Embedding Based on Common Therapeutic Targets of a Drug
Developing a new drug is a long and expensive process that typically takes 10-15 years and costs billions of dollars. This has led to an increasing interest in drug repositioning, which involves finding new therapeutic uses for existing drugs. Computational methods become an increasingly important tool for identifying associations between drugs and new diseases. Graph- and hypergraph-based approaches are a type of computational method that can be used to identify potential associations between drugs and new diseases. Here, we present a drug repurposing method based on hypergraph neural network for predicting drug-disease association in three stages. First, it constructs a heterogeneous graph that contains drug and disease nodes and links between them; in the second stage, it converts the heterogeneous simple graph to a hypergraph with only disease nodes. This is achieved by grouping diseases that use the same drug into a hyperedge. Indeed, all the diseases that are the common therapeutic goal of a drug are placed on a hyperedge. Finally, a graph neural network is used to predict drug-disease association based on the structure of the hypergraph. This model is more efficient than other methods because it uses a hypergraph to model relationships more effectively than graphs. Furthermore, it constructs the hypergraph using only a drug-disease association matrix, eliminating the need for extensive amounts of data. Experimental results show that the hypergraph-based approach effectively captures complex interrelationships between drugs and diseases, leading to improved accuracy of drug-disease association prediction compared to state-of-the-art methods.
Fuzzy-Based Identification of Transition Cells to Infer Cell Trajectory for Single-Cell Transcriptomics
With the continuous evolution of single-cell RNA sequencing technology, it has become feasible to reconstruct cell development processes using computational methods. Trajectory inference is a crucial downstream analytical task that provides valuable insights into understanding cell cycle and differentiation. During cell development, cells exhibit both stable and transition states, which makes it challenging to accurately identify these cells. To address this challenge, we propose a novel single-cell trajectory inference method using fuzzy clustering, named scFCTI. By introducing fuzzy clustering and quantifying cell uncertainty, scFCTI can identify transition cells within unstable cell states. Moreover, scFCTI can obtain refined cell classification by characterizing different cell stages, which gain more accurate single-cell trajectory reconstruction containing transition paths. To validate the effectiveness of scFCTI, we conduct experiments on five real datasets and four different structure simulation datasets, comparing them with several state-of-the-art trajectory inference methods. The results demonstrate that scFCTI outperforms these methods by successfully identifying unstable cell clusters and obtaining more accurate cell paths with transition states. Especially the experimental results demonstrate that scFCTI can reconstruct the cell trajectory more precisely.
19th International Symposium on Bioinformatics Research and Applications (ISBRA 2023)
Antibody Design with SE(3) Diffusion
We introduce , an antibody variable domain diffusion model based on a general protein backbone diffusion framework, which was extended to handle multiple chains. Assessing the designability and novelty of the structures generated with our model, we find that produces highly designable antibodies that can contain novel binding regions. The backbone dihedral angles of sampled structures show good agreement with a reference antibody distribution. We verify these designed antibodies experimentally and find that all express with high yield. Finally, we compare our model with a state-of-the-art generative backbone diffusion model on a range of antibody design tasks, such as the design of the complementarity determining regions or the pairing of a light chain to an existing heavy chain, and show improved properties and designability.
A Graph-Based Machine-Learning Approach Combined with Optical Measurements to Understand Beating Dynamics of Cardiomyocytes
The development of computational models for the prediction of cardiac cellular dynamics remains a challenge due to the lack of first-principled mathematical models. We develop a novel machine-learning approach hybridizing physics simulation and graph networks to deliver robust predictions of cardiomyocyte dynamics. Embedded with inductive physical priors, the proposed constraint-based interaction neural projection (CINP) algorithm can uncover hidden physical constraints from sparse image data on a small set of beating cardiac cells and provide robust predictions for heterogenous large-scale cell sets. We also implement an in vitro culture and imaging platform for cellular motion and calcium transient analysis to validate the model. We showcase our model's efficacy by predicting complex organoid cellular behaviors in both in silico and in vitro settings.
An R Package for Nonparametric Inference on Dynamic Populations with Infinitely Many Types
Fleming-Viot diffusions are widely used stochastic models for population dynamics that extend the celebrated Wright-Fisher diffusions. They describe the temporal evolution of the relative frequencies of the allelic types in an ideally infinite panmictic population, whose individuals undergo random genetic drift and at birth can mutate to a new allelic type drawn from a possibly infinite potential pool, independently of their parent. Recently, Bayesian nonparametric inference has been considered for this model when a finite sample of individuals is drawn from the population at several discrete time points. Previous works have fully described the relevant estimators for this problem, but current software is available only for the Wright-Fisher finite-dimensional case. Here, we provide software for the general case, overcoming some nontrivial computational challenges posed by this setting. The R package FVDDPpkg efficiently approximates the filtering and smoothing distribution for Fleming-Viot diffusions, given finite samples of individuals collected at different times. A suitable Monte Carlo approximation is also introduced in order to reduce the computational cost.
The Statistics of Parametrized Syncmers in a Simple Mutation Process Without Spurious Matches
Often, bioinformatics uses summary sketches to analyze next-generation sequencing data, but most sketches are not well understood statistically. Under a simple mutation model, Blanca et al. analyzed complete sketches, that is, the complete set of unassembled -mers, from two closely related sequences. The analysis extracted a point mutation parameter θ quantifying the evolutionary distance between the two sequences. We extend the results of Blanca et al. for complete sketches to parametrized syncmer sketches with downsampling. A syncmer sketch can sample -mers much more sparsely than a complete sketch. Consider the following simple mutation model disallowing insertions or deletions. Consider a reference sequence (e.g., a subsequence from a reference genome), and mutate each nucleotide in it independently with probability θ to produce a mutated sequence (corresponding to, e.g., a set of reads or draft assembly of a related genome). Then, syncmer counts alone yield an approximate Gaussian distribution for estimating θ. The assumption disallowing insertions and deletions motivates a check on the lengths of and . The syncmer count from yields an approximate Gaussian distribution for its length, and a -value can test the length of against the length of using syncmer counts alone. The Gaussian distributions permit syncmer counts alone to estimate θ and mutated sequence length with a known sampling error. Under some circumstances, the results provide the sampling error for the Mash containment index when applied to syncmer counts. The approximate Gaussian distributions provide hypothesis tests and confidence intervals for phylogenetic distance and sequence length. Our methods are likely to generalize to sketches other than syncmers and may be useful in assembling reads and related applications.
Adaptive Arithmetic Coding-Based Encoding Method Toward High-Density DNA Storage
With the rapid advancement of big data and artificial intelligence technologies, the limitations inherent in traditional storage media for accommodating vast amounts of data have become increasingly evident. DNA storage is an innovative approach harnessing DNA and other biomolecules as storage mediums, endowed with superior characteristics including expansive capacity, remarkable density, minimal energy requirements, and unparalleled longevity. Central to the efficient DNA storage is the process of DNA coding, whereby digital information is converted into sequences of DNA bases. A novel encoding method based on adaptive arithmetic coding (AAC) has been introduced, delineating the encoding process into three distinct phases: compression, error correction, and mapping. Prediction by Partial Matching (PPM)-based AAC in the compression phase serves to compress data and enhance storage density. Subsequently, the error correction phase relies on octal Hamming code to rectify errors and safeguard data integrity. The mapping phase employs a "3-2 code" mapping relationship to ensure adherence to biochemical constraints. The proposed method was verified by encoding different formats of files such as text, pictures, and audio. The results indicated that the average coding density of bases can be up to 3.25 per nucleotide, the GC content (which includes guanine [G] and cytosine [C]) can be stabilized at 50% and the homopolymer length is restricted to no more than 2. Simulation experimental results corroborate the method's efficacy in preserving data integrity during both reading and writing operations, augmenting storage density, and exhibiting robust error correction capabilities.
Attention-Guided Residual U-Net with SE Connection and ASPP for Watershed-Based Cell Segmentation in Microscopy Images
Time-lapse microscopy imaging is a crucial technique in biomedical studies for observing cellular behavior over time, providing essential data on cell numbers, sizes, shapes, and interactions. Manual analysis of hundreds or thousands of cells is impractical, necessitating the development of automated cell segmentation approaches. Traditional image processing methods have made significant progress in this area, but the advent of deep learning methods, particularly those using U-Net-based networks, has further enhanced performance in medical and microscopy image segmentation. However, challenges remain, particularly in accurately segmenting touching cells in images with low signal-to-noise ratios. Existing methods often struggle with effectively integrating features across different levels of abstraction. This can lead to model confusion, particularly when important contextual information is lost or the features are not adequately distinguished. The challenge lies in appropriately combining these features to preserve critical details while ensuring robust and accurate segmentation. To address these issues, we propose a novel framework called RA-SE-ASPP-Net, which incorporates Residual Blocks, Attention Mechanism, Squeeze-and-Excitation connection, and Atrous Spatial Pyramid Pooling to achieve precise and robust cell segmentation. We evaluate our proposed architecture using an induced pluripotent stem cell reprogramming dataset, a challenging dataset that has received limited attention in this field. Additionally, we compare our model with different ablation experiments to demonstrate its robustness. The proposed architecture outperforms the baseline models in all evaluated metrics, providing the most accurate semantic segmentation results. Finally, we applied the watershed method to the semantic segmentation results to obtain precise segmentations with specific information for each cell.
Generative AI Models for the Protein Scaffold Filling Problem
De novo protein sequencing is an important problem in proteomics, playing a crucial role in understanding protein functions, drug discovery, design and evolutionary studies, etc. Top-down and bottom-up tandem mass spectrometry are popular approaches used in the field of mass spectrometry to analyze and sequence proteins. However, these approaches often produce incomplete protein sequences with gaps, namely scaffolds. The protein scaffold filling problem refers to filling the missing amino acids in the gaps of a scaffold to infer the complete protein sequence. In this article, we tackle the protein scaffold filling problem based on generative AI techniques, such as convolutional denoising autoencoder, transformer, and generative pretrained transformer (GPT) models, to complete the protein sequences and compare our results with recently developed convolutional long short-term memory-based sequence model. We evaluate the model performance both on a real dataset and generated datasets. All proposed models show outstanding prediction accuracy. Notably, the GPT-2 model achieves 100% gap-filling accuracy and 100% full sequence accuracy on the MabCampth protein scaffold, which outperforms the other models.
Correcting for Observation Bias in Cancer Progression Modeling
Tumor progression is driven by the accumulation of genetic alterations, including both point mutations and copy number changes. Understanding the temporal sequence of these events is crucial for comprehending the disease but is not directly discernible from cross-sectional genomic data. Cancer progression models, including Mutual Hazard Networks (MHNs), aim to reconstruct the dynamics of tumor progression by learning the causal interactions between genetic events based on their co-occurrence patterns in cross-sectional data. Here, we highlight a commonly overlooked bias in cross-sectional datasets that can distort progression modeling. Tumors become clinically detectable when they cause symptoms or are identified through imaging or tests. Detection factors, such as size, inflammation (fever, fatigue), and elevated biochemical markers, are influenced by genomic alterations. Ignoring these effects leads to "conditioning on a collider" bias, where events making the tumor more observable appear anticorrelated, creating false suppressive effects or masking promoting effects among genetic events. We enhance MHNs by incorporating the effects of genetic progression events on the inclusion of a tumor in a dataset, thus correcting for collider bias. We derive an efficient tensor formula for the likelihood function and apply it to two datasets from the MSK-IMPACT study. In colon adenocarcinoma, we observe a significantly higher rate of clinical detection for TP53-positive tumors, while in lung adenocarcinoma, the same is true for EGFR-positive tumors. Compared to classical MHNs, this approach eliminates several spurious suppressive interactions and uncovers multiple promoting effects.