BIOINFORMATICS

APAtizer: a tool for alternative polyadenylation analysis of RNA-Seq data
Sousa B, Bessa M, de Mendonça FL, Ferreira PG, Moreira A and Pereira-Castro I
APAtizer is a tool designed to analyze alternative polyadenylation events on RNA-sequencing data. The tool handles different file formats, including BAM, htseq and DaPars bedGraph files. It provides a user-friendly interface that allows users to generate informative visualizations, including Volcano plots, heatmaps and gene lists. These outputs allow the user to retrieve useful biological insights such as the occurrence of polyadenylation events when comparing two biological conditions. Additionally, it can perform differential gene expression, gene ontology analysis, visualization of Venn diagram intersections and correlation analysis.
M2ara: unraveling metabolomic drug responses in whole-cell MALDI mass spectrometry bioassays
Enzlein T, Geisel A, Hopf C and Schmidt S
Fast computational evaluation and classification of concentration responses for hundreds of metabolites represented by their mass-to-charge (m/z) ratios is indispensable for unraveling complex metabolomic drug actions in label-free, whole-cell Matrix-Assisted Laser Desorption/Ionization Mass Spectrometry (MALDI MS) bioassays. In particular, the identification of novel pharmacodynamic biomarkers to determine target engagement, potency and potential polypharmacology of drug-like compounds in high-throughput applications requires robust data interpretation pipelines. Given the large number of mass features in cell-based MALDI MS bioassays, reliable identification of true biological response patterns and their differentiation from any measurement artefacts that may be present is critical. To facilitate the exploration of metabolomic responses in complex MALDI MS datasets, we present a novel software tool, M2ara. Implemented as a user-friendly R-based shiny application, it enables rapid evaluation of Molecular High Content Screening (MHCS) assay data. Furthermore, we introduce the concept of Curve Response Score (CRS) and CRS fingerprints to enable rapid visual inspection and ranking of mass features. In addition, these CRS fingerprints allow direct comparison of cellular effects among different compounds. Beyond cellular assays, our computational framework can also be applied to MALDI MS-based (cell-free) biochemical assays in general.
DrugRepPT: a deep pre-training and fine-tuning framework for drug repositioning based on drug's expression perturbation and treatment effectiveness
Fan S, Yang K, Lu K, Dong X, Li X, Zhu Q, Li S, Zeng J and Zhou X
Drug repositioning, identifying novel indications for approved drugs, is a cost-effective strategy in drug discovery. Despite numerous proposed drug repositioning models, integrating network-based features, differential gene expression, and chemical structures for high-performance drug repositioning remains challenging.
Damsel: Analysis and visualisation of DamID sequencing in R
Page CG, Londsdale A, Mitchell KA, Schröder J, Harvey KF and Oshlack A
DamID sequencing is a technique to map the genome-wide interaction of a protein with DNA. Damsel is the first Bioconductor package to provide an end to end analysis for DamID sequencing data within R. Damsel performs quantification and testing of significant binding sites along with exploratory and visual analysis. Damsel produces results consistent with previous analysis approaches.
Dynamic modelling of signalling pathways when ODEs are not feasible
Rachel T, Brombacher E, Wöhrle S, Groß O and Kreutz C
Mathematical modelling plays a crucial role in understanding inter- and intracellular signalling processes. Currently, ordinary differential equations (ODEs) are the predominant approach in systems biology for modelling such pathways. While ODE models offer mechanistic interpretability, they also suffer from limitations, including the need to consider all relevant compounds, resulting in large models difficult to handle numerically and requiring extensive data.
Sparse Neighbor Joining: rapid phylogenetic inference using a sparse distance matrix
Kurt S, Bouchard-Côté A and Lagergren J
Phylogenetic reconstruction is a fundamental problem in computational biology. The Neighbor Joining (NJ) algorithm offers an efficient distance-based solution to this problem, which often serves as the foundation for more advanced statistical methods. Despite prior efforts to enhance the speed of NJ, the computation of the n  2 entries of the distance matrix, where n is the number of phylogenetic tree leaves, continues to pose a limitation in scaling NJ to larger datasets.
MMOSurv: Meta-learning for few-shot survival analysis with multi-omics data
Wen G and Li L
High-throughput techniques have produced a large amount of high-dimensional multi-omics data, which makes it promising to predict patient survival outcomes more accurately. Recent work has showed the superiority of multi-omics data in survival analysis. However, it remains challenging to integrate multi-omics data to solve few-shot survival prediction problem, with only a few available training samples, especially for rare cancers.
DeepDR: a deep learning library for drug response prediction
Jiang Z and Li P
Accurate drug response prediction is critical to advancing precision medicine and drug discovery. Recent advances in deep learning (DL) have shown promise in predicting drug response; however, the lack of convenient tools to support such modeling limits their widespread application. To address this, we introduce DeepDR, the first DL library specifically developed for drug response prediction. DeepDR simplifies the process by automating drug and cell featurization, model construction, training, and inference, all achievable with brief programming. The library incorporates three types of drug features along with nine drug encoders, four types of cell features along with nine cell encoders, and two fusion modules, enabling the implementation of up to 135 DL models for drug response prediction. We also explored benchmarking performance with DeepDR, and the optimal models are available on a user-friendly visual interface.
Pf-HaploAtlas: an interactive web app for spatiotemporal analysis of P. falciparum genes
Lee C, Ünlü ES, White NFD, Almagro-Garcia J, Ariani C and Pearson RD
Monitoring the genomic evolution of Plasmodium falciparum-the most widespread and deadliest of the human-infecting malaria species-is critical for making decisions in response to changes in drug resistance, diagnostic test failures, and vaccine effectiveness. The MalariaGEN data resources are the world's largest whole genome sequencing databases for Plasmodium parasites. The size and complexity of such data is a barrier to many potential end users in both public health and academic research. A user-friendly method for accessing and exploring data on the genetic variation of P. falciparum would greatly enable efforts in studying and controlling malaria.
Gene count estimation with pytximport enables reproducible analysis of bulk RNA sequencing data in Python
Kuehl M, Wong MN, Wanner N, Bonn S and Puelles VG
Transcript quantification tools efficiently map bulk RNA sequencing reads to reference transcriptomes. However, their output consists of transcript count estimates that are subject to multiple biases and cannot be readily used with existing differential gene expression analysis tools in Python.Here we present pytximport, a Python implementation of the tximport R package that supports a variety of input formats, different modes of bias correction, inferential replicates, gene-level summarization of transcript counts, transcript-level exports, transcript-to-gene mapping generation and optional filtering of transcripts by biotype. pytximport is part of the scverse ecosystem of open-source Python software packages for omics analyses and includes both a Python as well as a command-line interface.With pytximport, we propose a bulk RNA sequencing analysis workflow based on Bioconda and scverse ecosystem packages, ensuring reproducible analyses through Snakemake rules. We apply this pipeline to a publicly available RNA-sequencing dataset, demonstrating how pytximport enables the creation of Python-centric workflows capable of providing insights into transcriptomic alterations.
Improved prediction of post-translational modification crosstalk within proteins using DeepPCT
Huang YX and Liu R
Post-translational modification (PTM) crosstalk events play critical roles in biological processes. Several machine learning methods have been developed to identify PTM crosstalk within proteins, but the accuracy is still far from satisfactory. Recent breakthroughs in deep learning and protein structure prediction could provide a potential solution to this issue.
Accurate and Transferable Drug-Target Interaction Prediction with DrugLAMP
Luo Z, Wu W, Sun Q and Wang J
Accurate prediction of drug-target interactions (DTIs), especially for novel targets or drugs, is crucial for accelerating drug discovery. Recent advances in pretrained language models (PLMs) and multi-modal learning present new opportunities to enhance DTI prediction by leveraging vast unlabeled molecular data and integrating complementary information from multiple modalities.
FastTENET: an accelerated TENET algorithm based on manycore computing in Python
Sung R, Kim H, Kim J and Lee D
TENET reconstructs gene regulatory networks from single-cell RNA sequencing (scRNAseq) data using the transfer entropy, and works successfully on a variety of scRNAseq data. However, TENET is limited by its long computation time for large datasets. To address this limitation, we propose FastTENET, an array-computing version of TENET algorithm optimized for acceleration on manycore processors such as GPUs. FastTENET counts the unique patterns of joint events to compute the transfer entropy based on array computing. Compared to TENET, FastTENET achieves up to 973× performance improvement.
Sensitivities in protein allocation models reveal distribution of metabolic capacity and flux control
van den Bogaard S, Saa PA and Alter TB
Expanding on constraint-based metabolic models, protein allocation models (PAMs) enhance flux predictions by accounting for protein resource allocation in cellular metabolism. Yet, to this date, there are no dedicated methods for analyzing and understanding the growth-limiting factors in simulated phenotypes in PAMs.
Mutual information for detecting multi-class biomarkers when integrating multiple bulk or single-cell transcriptomic studies
Zou J, Li Z, Carleton N, Oesterreich S, Lee AV and Tseng GC
Biomarker detection plays a pivotal role in biomedical research. Integrating omics studies from multiple cohorts can enhance statistical power, accuracy and robustness of the detection results. However, existing methods for horizontally combining omics studies are mostly designed for two-class scenarios (e.g., cases versus controls) and are not directly applicable for studies with multi-class design (e.g., samples from multiple disease subtypes, treatments, tissues, or cell types).
STRPsearch: fast detection of structured tandem repeat proteins
Mozaffari S, Arrías PN, Clementel D, Piovesan D, Ferrari C, Tosatto SCE and Monzon AM
Structured Tandem Repeats Proteins (STRPs) constitute a subclass of tandem repeats characterized by repetitive structural motifs. These proteins exhibit distinct secondary structures that form repetitive tertiary arrangements, often resulting in large molecular assemblies. Despite highly variable sequences, STRPs can perform important and diverse biological functions, maintaining a consistent structure with a variable number of repeat units. With the advent of protein structure prediction methods, millions of 3D-models of proteins are now publicly available. However, automatic detection of STRPs remains challenging with current state-of-the-art tools due to their lack of accuracy and long execution times, hindering their application on large datasets. In most cases, manual curation remains the most accurate method for detecting and classifying STRPs, making it impracticable to annotate millions of structures.
PhosX: data-driven kinase activity inference from phosphoproteomics experiments
Lussana A, Müller-Dott S, Saez-Rodriguez J and Petsalaki E
The inference of kinase activity from phosphoproteomics data can point to causal mechanisms driving signalling processes and potential drug targets. Identifying the kinases whose change in activity explains the observed phosphorylation profiles, however, remains challenging, and constrained by the manually curated knowledge of kinase-substrate associations. Recently, experimentally determined substrate sequence specificities of human kinases have become available, but robust methods to exploit this new data for kinase activity inference are still missing. We present PhosX, a method to estimate differential kinase activity from phosphoproteomics data that combines state-of-the art statistics in enrichment analysis with kinases' substrate sequence specificity information. Using a large phosphoproteomics dataset with known differentially regulated kinases we show that our method identifies upregulated and downregulated kinases by only relying on the input phosphopeptides' sequences and intensity changes. We find that PhosX outperforms the currently available approach for the same task, and performs better or similarly to state-of-the-art methods that rely on previously known kinase-substrate associations. We therefore recommend its use for data-driven kinase activity inference.
Tiberius: End-to-end deep learning with an HMM for gene prediction
Gabriel L, Becker F, Hoff KJ and Stanke M
For more than 25 years, learning-based eukaryotic gene predictors were driven by hidden Markov models (HMMs), which were directly inputted a DNA sequence. Recently, Holst et al. demonstrated with their program Helixer that the accuracy of ab initio eukaryotic gene prediction can be improved by combining deep learning layers with a separate HMM postprocessor.
Micro-DeMix: a mixture beta-multinomial model for investigating the heterogeneity of the stool microbiome compositions
Liu R, Wang Y and Cheng D
Extensive research has uncovered the critical role of the human gut microbiome in various aspects of health, including metabolism, nutrition, physiology, and immune function. Fecal microbiota is often used as a proxy for understanding the gut microbiome, but it represents an aggregate view, overlooking spatial variations across different gastrointestinal (GI) locations. Emerging studies with spatial microbiome data collected from specific GI regions offer a unique opportunity to better understand the spatial composition of the stool microbiome.
Detecting transposable elements in long read genomes using sTELLeR
Bilgrav Saether K and Eisfeldt J
Repeat elements such as transposable elements (TE), are highly repetitive DNA sequences that compose around 50% of the genome. TEs such as Alu, SVA, HERV and L1 elements can cause disease through disrupting genes, causing frameshift mutations or altering splicing patters. These are elements challenging to characterize using short-read genome sequencing (srGS), due to its read length and TEs repetitive nature. Long read genome sequencing (lrGS) enables bridging of TEs, allowing increased resolution across repetitive DNA sequences. lrGS therefore present an opportunity for improved TE detection and analysis, not only from a research perspective, but also for future clinical detection. When choosing a lrGS TE caller, parameters such as runtime, CPU hours, sensitivity, precision and compatibility with inclusion into pipelines are crucial for efficient detection.
OneSC: A computational platform for recapitulating cell state transitions
Peng D and Cahan P
Computational modelling of cell state transitions has been a great interest of many in the field of developmental biology, cancer biology and cell fate engineering because it enables performing perturbation experiments in silico more rapidly and cheaply than could be achieved in a lab. Recent advancements in single-cell RNA sequencing (scRNA-seq) allow the capture of high-resolution snapshots of cell states as they transition along temporal trajectories. Using these high-throughput datasets, we can train computational models to generate in silico 'synthetic' cells that faithfully mimic the temporal trajectories.