BIOINFORMATICS

Tiberius: End-to-end deep learning with an HMM for gene prediction
Gabriel L, Becker F, Hoff KJ and Stanke M
For more than 25 years, learning-based eukaryotic gene predictors were driven by hidden Markov models (HMMs), which were directly inputted a DNA sequence. Recently, Holst et al. demonstrated with their program Helixer that the accuracy of ab initio eukaryotic gene prediction can be improved by combining deep learning layers with a separate HMM postprocessor.
Dynamic modelling of signalling pathways when ODEs are not feasible
Rachel T, Brombacher E, Wöhrle S, Groß O and Kreutz C
Mathematical modelling plays a crucial role in understanding inter- and intracellular signalling processes. Currently, ordinary differential equations (ODEs) are the predominant approach in systems biology for modelling such pathways. While ODE models offer mechanistic interpretability, they also suffer from limitations, including the need to consider all relevant compounds, resulting in large models difficult to handle numerically and requiring extensive data.
STRPsearch: fast detection of structured tandem repeat proteins
Mozaffari S, Arrías PN, Clementel D, Piovesan D, Ferrari C, Tosatto SCE and Monzon AM
Structured Tandem Repeats Proteins (STRPs) constitute a subclass of tandem repeats characterized by repetitive structural motifs. These proteins exhibit distinct secondary structures that form repetitive tertiary arrangements, often resulting in large molecular assemblies. Despite highly variable sequences, STRPs can perform important and diverse biological functions, maintaining a consistent structure with a variable number of repeat units. With the advent of protein structure prediction methods, millions of 3D-models of proteins are now publicly available. However, automatic detection of STRPs remains challenging with current state-of-the-art tools due to their lack of accuracy and long execution times, hindering their application on large datasets. In most cases, manual curation remains the most accurate method for detecting and classifying STRPs, making it impracticable to annotate millions of structures.
DeepDR: a deep learning library for drug response prediction
Jiang Z and Li P
Accurate drug response prediction is critical to advancing precision medicine and drug discovery. Recent advances in deep learning (DL) have shown promise in predicting drug response; however, the lack of convenient tools to support such modeling limits their widespread application. To address this, we introduce DeepDR, the first DL library specifically developed for drug response prediction. DeepDR simplifies the process by automating drug and cell featurization, model construction, training, and inference, all achievable with brief programming. The library incorporates three types of drug features along with nine drug encoders, four types of cell features along with nine cell encoders, and two fusion modules, enabling the implementation of up to 135 DL models for drug response prediction. We also explored benchmarking performance with DeepDR, and the optimal models are available on a user-friendly visual interface.
Facilitating phenotyping from clinical texts: the medkit library
Neuraz A, Vaillant G, Arias C, Birot O, Huynh KT, Fabacher T, Rogier A, Garcelon N, Lerner I, Rance B and Coulet A
Phenotyping consists in applying algorithms to identify individuals associated with a specific, potentially complex, trait or condition, typically out of a collection of Electronic Health Records (EHRs). Because a lot of the clinical information of EHRs are lying in texts, phenotyping from text takes an important role in studies that rely on the secondary use of EHRs. However, the heterogeneity and highly specialized aspect of both the content and form of clinical texts makes this task particularly tedious, and is the source of time and cost constraints in observational studies.
Gene count estimation with pytximport enables reproducible analysis of bulk RNA sequencing data in Python
Kuehl M, Wong MN, Wanner N, Bonn S and Puelles VG
Transcript quantification tools efficiently map bulk RNA sequencing reads to reference transcriptomes. However, their output consists of transcript count estimates that are subject to multiple biases and cannot be readily used with existing differential gene expression analysis tools in Python.Here we present pytximport, a Python implementation of the tximport R package that supports a variety of input formats, different modes of bias correction, inferential replicates, gene-level summarization of transcript counts, transcript-level exports, transcript-to-gene mapping generation and optional filtering of transcripts by biotype. pytximport is part of the scverse ecosystem of open-source Python software packages for omics analyses and includes both a Python as well as a command-line interface.With pytximport, we propose a bulk RNA sequencing analysis workflow based on Bioconda and scverse ecosystem packages, ensuring reproducible analyses through Snakemake rules. We apply this pipeline to a publicly available RNA-sequencing dataset, demonstrating how pytximport enables the creation of Python-centric workflows capable of providing insights into transcriptomic alterations.
Damsel: Analysis and visualisation of DamID sequencing in R
Page CG, Londsdale A, Mitchell KA, Schröder J, Harvey KF and Oshlack A
DamID sequencing is a technique to map the genome-wide interaction of a protein with DNA. Damsel is the first Bioconductor package to provide an end to end analysis for DamID sequencing data within R. Damsel performs quantification and testing of significant binding sites along with exploratory and visual analysis. Damsel produces results consistent with previous analysis approaches.
R3DMCS: a web server for visualizing structural variation in RNA motifs across experimental 3D structures from the same organism or across species
Appasamy SD and Zirbel CL
The recent progress in RNA structure determination methods has resulted in a surge of newly solved RNA 3D structures. However, there is an absence of a user-friendly browser-based tool that can facilitate the comparison and visualization of RNA motifs across multiple 3D structures.
Micro-DeMix: a mixture beta-multinomial model for investigating the heterogeneity of the stool microbiome compositions
Liu R, Wang Y and Cheng D
Extensive research has uncovered the critical role of the human gut microbiome in various aspects of health, including metabolism, nutrition, physiology, and immune function. Fecal microbiota is often used as a proxy for understanding the gut microbiome, but it represents an aggregate view, overlooking spatial variations across different gastrointestinal (GI) locations. Emerging studies with spatial microbiome data collected from specific GI regions offer a unique opportunity to better understand the spatial composition of the stool microbiome.
DrugRepPT: a deep pre-training and fine-tuning framework for drug repositioning based on drug's expression perturbation and treatment effectiveness
Fan S, Yang K, Lu K, Dong X, Li X, Zhu Q, Li S, Zeng J and Zhou X
Drug repositioning, identifying novel indications for approved drugs, is a cost-effective strategy in drug discovery. Despite numerous proposed drug repositioning models, integrating network-based features, differential gene expression, and chemical structures for high-performance drug repositioning remains challenging.
PhosX: data-driven kinase activity inference from phosphoproteomics experiments
Lussana A, Müller-Dott S, Saez-Rodriguez J and Petsalaki E
The inference of kinase activity from phosphoproteomics data can point to causal mechanisms driving signalling processes and potential drug targets. Identifying the kinases whose change in activity explains the observed phosphorylation profiles, however, remains challenging, and constrained by the manually curated knowledge of kinase-substrate associations. Recently, experimentally determined substrate sequence specificities of human kinases have become available, but robust methods to exploit this new data for kinase activity inference are still missing. We present PhosX, a method to estimate differential kinase activity from phosphoproteomics data that combines state-of-the art statistics in enrichment analysis with kinases' substrate sequence specificity information. Using a large phosphoproteomics dataset with known differentially regulated kinases we show that our method identifies upregulated and downregulated kinases by only relying on the input phosphopeptides' sequences and intensity changes. We find that PhosX outperforms the currently available approach for the same task, and performs better or similarly to state-of-the-art methods that rely on previously known kinase-substrate associations. We therefore recommend its use for data-driven kinase activity inference.
Mutual information for detecting multi-class biomarkers when integrating multiple bulk or single-cell transcriptomic studies
Zou J, Li Z, Carleton N, Oesterreich S, Lee AV and Tseng GC
Biomarker detection plays a pivotal role in biomedical research. Integrating omics studies from multiple cohorts can enhance statistical power, accuracy and robustness of the detection results. However, existing methods for horizontally combining omics studies are mostly designed for two-class scenarios (e.g., cases versus controls) and are not directly applicable for studies with multi-class design (e.g., samples from multiple disease subtypes, treatments, tissues, or cell types).
MMOSurv: Meta-learning for few-shot survival analysis with multi-omics data
Wen G and Li L
High-throughput techniques have produced a large amount of high-dimensional multi-omics data, which makes it promising to predict patient survival outcomes more accurately. Recent work has showed the superiority of multi-omics data in survival analysis. However, it remains challenging to integrate multi-omics data to solve few-shot survival prediction problem, with only a few available training samples, especially for rare cancers.
Detecting transposable elements in long read genomes using sTELLeR
Bilgrav Saether K and Eisfeldt J
Repeat elements such as transposable elements (TE), are highly repetitive DNA sequences that compose around 50% of the genome. TEs such as Alu, SVA, HERV and L1 elements can cause disease through disrupting genes, causing frameshift mutations or altering splicing patters. These are elements challenging to characterize using short-read genome sequencing (srGS), due to its read length and TEs repetitive nature. Long read genome sequencing (lrGS) enables bridging of TEs, allowing increased resolution across repetitive DNA sequences. lrGS therefore present an opportunity for improved TE detection and analysis, not only from a research perspective, but also for future clinical detection. When choosing a lrGS TE caller, parameters such as runtime, CPU hours, sensitivity, precision and compatibility with inclusion into pipelines are crucial for efficient detection.
APAtizer: a tool for alternative polyadenylation analysis of RNA-Seq data
Sousa B, Bessa M, de Mendonça FL, Ferreira PG, Moreira A and Pereira-Castro I
APAtizer is a tool designed to analyze alternative polyadenylation events on RNA-sequencing data. The tool handles different file formats, including BAM, htseq and DaPars bedGraph files. It provides a user-friendly interface that allows users to generate informative visualizations, including Volcano plots, heatmaps and gene lists. These outputs allow the user to retrieve useful biological insights such as the occurrence of polyadenylation events when comparing two biological conditions. Additionally, it can perform differential gene expression, gene ontology analysis, visualization of Venn diagram intersections and correlation analysis.
Optimizing Multi-Omics Data Imputation with NMF and GAN Synergy
Ansari MI, Ahmed KT and Zhang W
Integrating multiple omics datasets can significantly advance our understanding of disease mechanisms, physiology, and treatment responses. However, a major challenge in multi-omics studies is the disparity in sample sizes across different datasets, which can introduce bias and reduce statistical power. To address this issue, we propose a novel framework, OmicsNMF, designed to impute missing omics data and enhance disease phenotype prediction. OmicsNMF integrates Generative Adversarial Networks (GANs) with Non-Negative Matrix Factorization (NMF). NMF is a well-established method for uncovering underlying patterns in omics data, while GANs enhance the imputation process by generating realistic data samples. This synergy aims to more effectively address sample size disparity, thereby improving data integration and prediction accuracy.
M2ara: unraveling metabolomic drug responses in whole-cell MALDI mass spectrometry bioassays
Enzlein T, Geisel A, Hopf C and Schmidt S
Fast computational evaluation and classification of concentration responses for hundreds of metabolites represented by their mass-to-charge (m/z) ratios is indispensable for unraveling complex metabolomic drug actions in label-free, whole-cell Matrix-Assisted Laser Desorption/Ionization Mass Spectrometry (MALDI MS) bioassays. In particular, the identification of novel pharmacodynamic biomarkers to determine target engagement, potency and potential polypharmacology of drug-like compounds in high-throughput applications requires robust data interpretation pipelines. Given the large number of mass features in cell-based MALDI MS bioassays, reliable identification of true biological response patterns and their differentiation from any measurement artefacts that may be present is critical. To facilitate the exploration of metabolomic responses in complex MALDI MS datasets, we present a novel software tool, M2ara. Implemented as a user-friendly R-based shiny application, it enables rapid evaluation of Molecular High Content Screening (MHCS) assay data. Furthermore, we introduce the concept of Curve Response Score (CRS) and CRS fingerprints to enable rapid visual inspection and ranking of mass features. In addition, these CRS fingerprints allow direct comparison of cellular effects among different compounds. Beyond cellular assays, our computational framework can also be applied to MALDI MS-based (cell-free) biochemical assays in general.
A signal-diffusion-based unsupervised contrastive representation learning for spatial transcriptomics analysis
Chen N, Yu X, Li W, Liu F, Luo Y and Zuo Z
Spatial transcriptomics allows for the measurement of high-throughput gene expression data while preserving the spatial structure of tissues and histological images. Integrating gene expression, spatial information, and image data to learn discriminative low-dimensional representations is critical for dissecting tissue heterogeneity and analyzing biological functions. However, most existing methods have limitations in effectively utilizing spatial information and high-resolution histological images. We propose a signal-diffusion-based unsupervised contrast learning method (SDUCL) for learning low-dimensional latent embeddings of cells/spots.
Sensitivities in protein allocation models reveal distribution of metabolic capacity and flux control
van den Bogaard S, Saa PA and Alter TB
Expanding on constraint-based metabolic models, protein allocation models (PAMs) enhance flux predictions by accounting for protein resource allocation in cellular metabolism. Yet, to this date, there are no dedicated methods for analyzing and understanding the growth-limiting factors in simulated phenotypes in PAMs.
LmRaC: a functionally extensible tool for LLM interrogation of user experimental results
Craig DB and Drăghici S
Large Language Models (LLMs) have provided spectacular results across a wide variety of domains. However, persistent concerns about hallucination and fabrication of authoritative sources raise serious issues for their integral use in scientific research. Retrieval-augmented generation (RAG) is a technique for making data and documents, otherwise unavailable during training, available to the LLM for reasoning tasks. In addition to making dynamic and quantitative data available to the LLM, RAG provides the means by which to carefully control and trace source material, thereby ensuring results are accurate, complete and authoritative.
Pf-HaploAtlas: an interactive web app for spatiotemporal analysis of P. falciparum genes
Lee C, Ünlü ES, White NFD, Almagro-Garcia J, Ariani C and Pearson RD
Monitoring the genomic evolution of Plasmodium falciparum-the most widespread and deadliest of the human-infecting malaria species-is critical for making decisions in response to changes in drug resistance, diagnostic test failures, and vaccine effectiveness. The MalariaGEN data resources are the world's largest whole genome sequencing databases for Plasmodium parasites. The size and complexity of such data is a barrier to many potential end users in both public health and academic research. A user-friendly method for accessing and exploring data on the genetic variation of P. falciparum would greatly enable efforts in studying and controlling malaria.