Micro-DeMix: a mixture beta-multinomial model for investigating the heterogeneity of the stool microbiome compositions
Extensive research has uncovered the critical role of the human gut microbiome in various aspects of health, including metabolism, nutrition, physiology, and immune function. Fecal microbiota is often used as a proxy for understanding the gut microbiome, but it represents an aggregate view, overlooking spatial variations across different gastrointestinal (GI) locations. Emerging studies with spatial microbiome data collected from specific GI regions offer a unique opportunity to better understand the spatial composition of the stool microbiome.
DrugRepPT: a deep pre-training and fine-tuning framework for drug repositioning based on drug's expression perturbation and treatment effectiveness
Drug repositioning, identifying novel indications for approved drugs, is a cost-effective strategy in drug discovery. Despite numerous proposed drug repositioning models, integrating network-based features, differential gene expression, and chemical structures for high-performance drug repositioning remains challenging.
Mutual information for detecting multi-class biomarkers when integrating multiple bulk or single-cell transcriptomic studies
Biomarker detection plays a pivotal role in biomedical research. Integrating omics studies from multiple cohorts can enhance statistical power, accuracy and robustness of the detection results. However, existing methods for horizontally combining omics studies are mostly designed for two-class scenarios (e.g., cases versus controls) and are not directly applicable for studies with multi-class design (e.g., samples from multiple disease subtypes, treatments, tissues, or cell types).
PhosX: data-driven kinase activity inference from phosphoproteomics experiments
The inference of kinase activity from phosphoproteomics data can point to causal mechanisms driving signalling processes and potential drug targets. Identifying the kinases whose change in activity explains the observed phosphorylation profiles, however, remains challenging, and constrained by the manually curated knowledge of kinase-substrate associations. Recently, experimentally determined substrate sequence specificities of human kinases have become available, but robust methods to exploit this new data for kinase activity inference are still missing. We present PhosX, a method to estimate differential kinase activity from phosphoproteomics data that combines state-of-the art statistics in enrichment analysis with kinases' substrate sequence specificity information. Using a large phosphoproteomics dataset with known differentially regulated kinases we show that our method identifies upregulated and downregulated kinases by only relying on the input phosphopeptides' sequences and intensity changes. We find that PhosX outperforms the currently available approach for the same task, and performs better or similarly to state-of-the-art methods that rely on previously known kinase-substrate associations. We therefore recommend its use for data-driven kinase activity inference.
STRPsearch: fast detection of structured tandem repeat proteins
Structured Tandem Repeats Proteins (STRPs) constitute a subclass of tandem repeats characterized by repetitive structural motifs. These proteins exhibit distinct secondary structures that form repetitive tertiary arrangements, often resulting in large molecular assemblies. Despite highly variable sequences, STRPs can perform important and diverse biological functions, maintaining a consistent structure with a variable number of repeat units. With the advent of protein structure prediction methods, millions of 3D-models of proteins are now publicly available. However, automatic detection of STRPs remains challenging with current state-of-the-art tools due to their lack of accuracy and long execution times, hindering their application on large datasets. In most cases, manual curation remains the most accurate method for detecting and classifying STRPs, making it impracticable to annotate millions of structures.
FungiFun3: Systemic gene set enrichment analysis for fungal species
The ever-growing amount of genome-wide omics data paved the way for solving life science problems in a data-driven manner. Among others, enrichment analysis is part of the standard analysis arsenal to determine systemic signals in any given transcriptomic or proteomic data. Only a part of the members of the fungal kingdom, however, can be analyzed via public web applications, despite the global rise of fungal pathogens and their increasing resistance to antimycotics. We present FungiFun3, a major update of our user-friendly gene set enrichment web application dedicated to fungi. FungiFun3 was rebuilt from scratch to support a modern and easy-to-use web interface and supports more than four-fold more fungal strains (n = 1,287 in total) than its predecessor. In addition, it also allows ranked gene set enrichment analysis at the genomic scale. FungiFun3 thus serves as a starting hub for identifying molecular signals in omics data sets related to a vast amount of available fungal strains including human fungal pathogens of the WHO's priority list and far beyond.
Improved prediction of post-translational modification crosstalk within proteins using DeepPCT
Post-translational modification (PTM) crosstalk events play critical roles in biological processes. Several machine learning methods have been developed to identify PTM crosstalk within proteins, but the accuracy is still far from satisfactory. Recent breakthroughs in deep learning and protein structure prediction could provide a potential solution to this issue.
M2ara: unraveling metabolomic drug responses in whole-cell MALDI mass spectrometry bioassays
Fast computational evaluation and classification of concentration responses for hundreds of metabolites represented by their mass-to-charge (m/z) ratios is indispensable for unraveling complex metabolomic drug actions in label-free, whole-cell Matrix-Assisted Laser Desorption/Ionization Mass Spectrometry (MALDI MS) bioassays. In particular, the identification of novel pharmacodynamic biomarkers to determine target engagement, potency and potential polypharmacology of drug-like compounds in high-throughput applications requires robust data interpretation pipelines. Given the large number of mass features in cell-based MALDI MS bioassays, reliable identification of true biological response patterns and their differentiation from any measurement artefacts that may be present is critical. To facilitate the exploration of metabolomic responses in complex MALDI MS datasets, we present a novel software tool, M2ara. Implemented as a user-friendly R-based shiny application, it enables rapid evaluation of Molecular High Content Screening (MHCS) assay data. Furthermore, we introduce the concept of Curve Response Score (CRS) and CRS fingerprints to enable rapid visual inspection and ranking of mass features. In addition, these CRS fingerprints allow direct comparison of cellular effects among different compounds. Beyond cellular assays, our computational framework can also be applied to MALDI MS-based (cell-free) biochemical assays in general.
FastTENET: an accelerated TENET algorithm based on manycore computing in Python
TENET reconstructs gene regulatory networks from single-cell RNA sequencing (scRNAseq) data using the transfer entropy, and works successfully on a variety of scRNAseq data. However, TENET is limited by its long computation time for large datasets. To address this limitation, we propose FastTENET, an array-computing version of TENET algorithm optimized for acceleration on manycore processors such as GPUs. FastTENET counts the unique patterns of joint events to compute the transfer entropy based on array computing. Compared to TENET, FastTENET achieves up to 973× performance improvement.
Accurate and Transferable Drug-Target Interaction Prediction with DrugLAMP
Accurate prediction of drug-target interactions (DTIs), especially for novel targets or drugs, is crucial for accelerating drug discovery. Recent advances in pretrained language models (PLMs) and multi-modal learning present new opportunities to enhance DTI prediction by leveraging vast unlabeled molecular data and integrating complementary information from multiple modalities.
Sparse Neighbor Joining: rapid phylogenetic inference using a sparse distance matrix
Phylogenetic reconstruction is a fundamental problem in computational biology. The Neighbor Joining (NJ) algorithm offers an efficient distance-based solution to this problem, which often serves as the foundation for more advanced statistical methods. Despite prior efforts to enhance the speed of NJ, the computation of the n 2 entries of the distance matrix, where n is the number of phylogenetic tree leaves, continues to pose a limitation in scaling NJ to larger datasets.
OneSC: A computational platform for recapitulating cell state transitions
Computational modelling of cell state transitions has been a great interest of many in the field of developmental biology, cancer biology and cell fate engineering because it enables performing perturbation experiments in silico more rapidly and cheaply than could be achieved in a lab. Recent advancements in single-cell RNA sequencing (scRNA-seq) allow the capture of high-resolution snapshots of cell states as they transition along temporal trajectories. Using these high-throughput datasets, we can train computational models to generate in silico 'synthetic' cells that faithfully mimic the temporal trajectories.
CVR-BBI: An Open-Source VR Platform for Multi-User Collaborative Brain to Brain Interfaces
As brain imaging and neurofeedback technologies advance, the brain-to-brain interface (BBI) has emerged as an innovative filed, enabling in-depth exploration of cross-brain information exchange and enhancing our understanding of collaborative intelligence. However, no open-source virtual reality (VR) platform currently supports the rapid and efficient configuration of multi-user, collaborative BBIs. To address this gap, we introduce the Collaborative Virtual Reality Brain-to-Brain Interface (CVR-BBI), an open-source platform consisting of a client and server. The CVR-BBI client enables users to participate in collaborative experiments, collect electroencephalogram (EEG) data and manage interactive multisensory stimuli within the VR environment. Meanwhile, the CVR-BBI server manages multi-user collaboration paradigms, and performs real-time analysis of the EEG data. We evaluated the CVR-BBI platform using the SSVEP paradigm and observed that collaborative decoding outperformed individual decoding, validating the platform's effectiveness in collaborative settings. The CVR-BBI offers a pioneering platform that facilitates the development of innovative BBI applications within collaborative VR environments, thereby enhancing the understanding of brain collaboration and cognition.
Damsel: Analysis and visualisation of DamID sequencing in R
DamID sequencing is a technique to map the genome-wide interaction of a protein with DNA. Damsel is the first Bioconductor package to provide an end to end analysis for DamID sequencing data within R. Damsel performs quantification and testing of significant binding sites along with exploratory and visual analysis. Damsel produces results consistent with previous analysis approaches.
Pf-HaploAtlas: an interactive web app for spatiotemporal analysis of P. falciparum genes
Monitoring the genomic evolution of Plasmodium falciparum-the most widespread and deadliest of the human-infecting malaria species-is critical for making decisions in response to changes in drug resistance, diagnostic test failures, and vaccine effectiveness. The MalariaGEN data resources are the world's largest whole genome sequencing databases for Plasmodium parasites. The size and complexity of such data is a barrier to many potential end users in both public health and academic research. A user-friendly method for accessing and exploring data on the genetic variation of P. falciparum would greatly enable efforts in studying and controlling malaria.
APAtizer: a tool for alternative polyadenylation analysis of RNA-Seq data
APAtizer is a tool designed to analyze alternative polyadenylation events on RNA-sequencing data. The tool handles different file formats, including BAM, htseq and DaPars bedGraph files. It provides a user-friendly interface that allows users to generate informative visualizations, including Volcano plots, heatmaps and gene lists. These outputs allow the user to retrieve useful biological insights such as the occurrence of polyadenylation events when comparing two biological conditions. Additionally, it can perform differential gene expression, gene ontology analysis, visualization of Venn diagram intersections and correlation analysis.
Gene count estimation with pytximport enables reproducible analysis of bulk RNA sequencing data in Python
Transcript quantification tools efficiently map bulk RNA sequencing reads to reference transcriptomes. However, their output consists of transcript count estimates that are subject to multiple biases and cannot be readily used with existing differential gene expression analysis tools in Python.Here we present pytximport, a Python implementation of the tximport R package that supports a variety of input formats, different modes of bias correction, inferential replicates, gene-level summarization of transcript counts, transcript-level exports, transcript-to-gene mapping generation and optional filtering of transcripts by biotype. pytximport is part of the scverse ecosystem of open-source Python software packages for omics analyses and includes both a Python as well as a command-line interface.With pytximport, we propose a bulk RNA sequencing analysis workflow based on Bioconda and scverse ecosystem packages, ensuring reproducible analyses through Snakemake rules. We apply this pipeline to a publicly available RNA-sequencing dataset, demonstrating how pytximport enables the creation of Python-centric workflows capable of providing insights into transcriptomic alterations.
Sensitivities in protein allocation models reveal distribution of metabolic capacity and flux control
Expanding on constraint-based metabolic models, protein allocation models (PAMs) enhance flux predictions by accounting for protein resource allocation in cellular metabolism. Yet, to this date, there are no dedicated methods for analyzing and understanding the growth-limiting factors in simulated phenotypes in PAMs.
MMOSurv: Meta-learning for few-shot survival analysis with multi-omics data
High-throughput techniques have produced a large amount of high-dimensional multi-omics data, which makes it promising to predict patient survival outcomes more accurately. Recent work has showed the superiority of multi-omics data in survival analysis. However, it remains challenging to integrate multi-omics data to solve few-shot survival prediction problem, with only a few available training samples, especially for rare cancers.
DeepDR: a deep learning library for drug response prediction
Accurate drug response prediction is critical to advancing precision medicine and drug discovery. Recent advances in deep learning (DL) have shown promise in predicting drug response; however, the lack of convenient tools to support such modeling limits their widespread application. To address this, we introduce DeepDR, the first DL library specifically developed for drug response prediction. DeepDR simplifies the process by automating drug and cell featurization, model construction, training, and inference, all achievable with brief programming. The library incorporates three types of drug features along with nine drug encoders, four types of cell features along with nine cell encoders, and two fusion modules, enabling the implementation of up to 135 DL models for drug response prediction. We also explored benchmarking performance with DeepDR, and the optimal models are available on a user-friendly visual interface.
Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction
Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast missense amino acid variant effect predictor, leveraging protein Language Model (pLM) embeddings as input to a minimal deep learning model.