CaClust: linking genotype to transcriptional heterogeneity of follicular lymphoma using BCR and exomic variants
Tumours exhibit high genotypic and transcriptional heterogeneity. Both affect cancer progression and treatment, but have been predominantly studied separately in follicular lymphoma. To comprehensively investigate the evolution and genotype-to-phenotype maps in follicular lymphoma, we introduce CaClust, a probabilistic graphical model integrating deep whole exome, single-cell RNA and B-cell receptor sequencing data to infer clone genotypes, cell-to-clone mapping, and single-cell genotyping. CaClust outperforms a state-of-the-art model on simulated and patient data. In-depth analyses of single cells from four samples showcase effects of driver mutations, follicular lymphoma evolution, possible therapeutic targets, and single-cell genotyping that agrees with an independent targeted resequencing experiment.
scDOT: optimal transport for mapping senescent cells in spatial transcriptomics
The low resolution of spatial transcriptomics data necessitates additional information for optimal use. We developed scDOT, which combines spatial transcriptomics and single cell RNA sequencing to improve the ability to reconstruct single cell resolved spatial maps and identify senescent cells. scDOT integrates optimal transport and expression deconvolution to learn non-linear couplings between cells and spots and to infer cell placements. Application of scDOT to lung spatial transcriptomics data improves on prior methods and allows the identification of the spatial organization of senescent cells, their neighboring cells and novel genes involved in cell-cell interactions that may be driving senescence.
GraphPCA: a fast and interpretable dimension reduction algorithm for spatial transcriptomics data
The rapid advancement of spatial transcriptomics technologies has revolutionized our understanding of cell heterogeneity and intricate spatial structures within tissues and organs. However, the high dimensionality and noise in spatial transcriptomic data present significant challenges for downstream data analyses. Here, we develop GraphPCA, an interpretable and quasi-linear dimension reduction algorithm that leverages the strengths of graphical regularization and principal component analysis. Comprehensive evaluations on simulated and multi-resolution spatial transcriptomic datasets generated from various platforms demonstrate the capacity of GraphPCA to enhance downstream analysis tasks including spatial domain detection, denoising, and trajectory inference compared to other state-of-the-art methods.
IAMSAM: image-based analysis of molecular signatures using the Segment Anything Model
Spatial transcriptomics is a cutting-edge technique that combines gene expression with spatial information, allowing researchers to study molecular patterns within tissue architecture. Here, we present IAMSAM, a user-friendly web-based tool for analyzing spatial transcriptomics data focusing on morphological features. IAMSAM accurately segments tissue images using the Segment Anything Model, allowing for the semi-automatic selection of regions of interest based on morphological signatures. Furthermore, IAMSAM provides downstream analysis, such as identifying differentially expressed genes, enrichment analysis, and cell type prediction within the selected regions. With its simple interface, IAMSAM empowers researchers to explore and interpret heterogeneous tissues in a streamlined manner.
Hierarchical annotation of eQTLs by H-eQTL enables identification of genes with cell type-divergent regulation
While context-type-specific regulation of genes is largely determined by cis-regulatory regions, attempts to identify cell type-specific eQTLs are complicated by the nested nature of cell types. We present hierarchical eQTL (H-eQTL), a network-based model for hierarchical annotation of bulk-derived eQTLs to levels of a cell type tree using single-cell chromatin accessibility data and no clustering of cells into discrete cell types. Using our model, we annotate bulk-derived eQTLs from the developing brain with high specificity to levels of a cell type hierarchy, which allows sensitive detection of genes with multiple distinct non-coding elements regulating their expression in different cell types.
Cohesin distribution alone predicts chromatin organization in yeast via conserved-current loop extrusion
Inhomogeneous patterns of chromatin-chromatin contacts within 10-100-kb-sized regions of the genome are a generic feature of chromatin spatial organization. These features, termed topologically associating domains (TADs), have led to the loop extrusion factor (LEF) model. Currently, our ability to model TADs relies on the observation that in vertebrates TAD boundaries are correlated with DNA sequences that bind CTCF, which therefore is inferred to block loop extrusion. However, although TADs feature prominently in their Hi-C maps, non-vertebrate eukaryotes either do not express CTCF or show few TAD boundaries that correlate with CTCF sites. In all of these organisms, the counterparts of CTCF remain unknown, frustrating comparisons between Hi-C data and simulations.
Adenine base editors induce off-target structure variations in mouse embryos and primary human T cells
The safety of CRISPR-based gene editing methods is of the utmost priority in clinical applications. Previous studies have reported that Cas9 cleavage induced frequent aneuploidy in primary human T cells, but whether cleavage-mediated editing of base editors would generate off-target structure variations remains unknown. Here, we investigate the potential off-target structural variations associated with CRISPR/Cas9, ABE, and CBE editing in mouse embryos and primary human T cells by whole-genome sequencing and single-cell RNA-seq analyses.
Transcription of a centromere-enriched retroelement and local retention of its RNA are significant features of the CENP-A chromatin landscape
Centromeres depend on chromatin containing the conserved histone H3 variant CENP-A for function and inheritance, while the role of centromeric DNA repeats remains unclear. Retroelements are prevalent at centromeres across taxa and represent a potential mechanism for promoting transcription to aid in CENP-A incorporation or for generating RNA transcripts to maintain centromere integrity.
VI-VS: calibrated identification of feature dependencies in single-cell multiomics
Unveiling functional relationships between various molecular cell phenotypes from data using machine learning models is a key promise of multiomics. Existing methods either use flexible but hard-to-interpret models or simpler, misspecified models. VI-VS (Variational Inference for Variable Selection) balances flexibility and interpretability to identify relevant feature relationships in multiomic data. It uses deep generative models to identify conditionally dependent features, with false discovery rate control. VI-VS is available as an open-source Python package, providing a robust solution to identify features more likely representing genuine causal relationships.
Considerations in the search for epistasis
Epistasis refers to changes in the effect on phenotype of a unit of genetic information, such as a single nucleotide polymorphism or a gene, dependent on the context of other genetic units. Such interactions are both biologically plausible and good candidates to explain observations which are not fully explained by an additive heritability model. However, the search for epistasis has so far largely failed to recover this missing heritability. We identify key challenges and propose that future works need to leverage idealized systems, known biology and even previously identified epistatic interactions, in order to guide the search for new interactions.
The genomic portrait of the Picene culture provides new insights into the Italic Iron Age and the legacy of the Roman Empire in Central Italy
The Italic Iron Age is characterized by the presence of various ethnic groups partially examined from a genomic perspective. To explore the evolution of Iron Age Italic populations and the genetic impact of Romanization, we focus on the Picenes, one of the most fascinating pre-Roman civilizations, who flourished on the Middle Adriatic side of Central Italy between the 9 and the 3 century BCE, until the Roman colonization.
scStateDynamics: deciphering the drug-responsive tumor cell state dynamics by modeling single-cell level expression changes
Understanding tumor cell heterogeneity and plasticity is crucial for overcoming drug resistance. Single-cell technologies enable analyzing cell states at a given condition, but catenating static cell snapshots to characterize dynamic drug responses remains challenging. Here, we propose scStateDynamics, an algorithm to infer tumor cell state dynamics and identify common drug effects by modeling single-cell level gene expression changes. Its reliability is validated on both simulated and lineage tracing data. Application to real tumor drug treatment datasets identifies more subtle cell subclusters with different drug responses beyond static transcriptome similarity and disentangles drug action mechanisms from the cell-level expression changes.
TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing
Oxford Nanopore Technologies (ONT) offers ultrahigh-throughput multi-sample sequencing but only provides barcode kits that enable up to 96-sample multiplexing. We present TDFPS-Designer, a new toolkit for nanopore sequencing barcode design, which creates significantly more barcodes: 137 with a length of 20 base pairs, 410 at 24 bp, and 1779 at 30 bp, far surpassing ONT's offerings. It includes GPU-based acceleration for ultra-fast demultiplexing and designs robust barcodes suitable for high-error ONT data. TDFPS-Designer outperforms current methods, improving the demultiplexing recall rate by 20% relative to Guppy, without a reduction in precision.
SpottedPy quantifies relationships between spatial transcriptomic hotspots and uncovers environmental cues of epithelial-mesenchymal plasticity in breast cancer
Spatial transcriptomics is revolutionizing the exploration of intratissue heterogeneity in cancer, yet capturing cellular niches and their spatial relationships remains challenging. We introduce SpottedPy, a Python package designed to identify tumor hotspots and map spatial interactions within the cancer ecosystem. Using SpottedPy, we examine epithelial-mesenchymal plasticity in breast cancer and highlight stable niches associated with angiogenic and hypoxic regions, shielded by CAFs and macrophages. Hybrid and mesenchymal hotspot distribution follows transformation gradients reflecting progressive immunosuppression. Our method offers flexibility to explore spatial relationships at different scales, from immediate neighbors to broader tissue modules, providing new insights into tumor microenvironment dynamics.
Publisher Correction: Tagging large CNV blocks in wheat boosts digitalization of germplasm resources by ultra-low-coverage sequencing
Plant conservation in the age of genome editing: opportunities and challenges
Numerous plant taxa are threatened by habitat destruction or overexploitation. To overcome these threats, new methods are urgently needed for rescuing threatened and endangered plant species. Here, we review the genetic consequences of threats to species populations. We highlight potential advantages of genome editing for mitigating negative effects caused by new pathogens and pests or climate change where other approaches have failed. We propose solutions to protect threatened plants using genome editing technology unless absolutely necessary. We further discuss the challenges associated with genome editing in plant conservation to mitigate the decline of plant diversity.
pan-Draft: automated reconstruction of species-representative metabolic models from multiple genomes
The accurate reconstruction of genome-scale metabolic models (GEMs) for unculturable species poses challenges due to the incomplete and fragmented genetic information typical of metagenome-assembled genomes (MAGs). While existing tools leverage sequence homology from single genomes, this study introduces pan-Draft, a pan-reactome-based approach exploiting recurrent genetic evidence to determine the solid core structure of species-level GEMs. By comparing MAGs clustered at the species-level, pan-Draft addresses the issues due to the incompleteness and contamination of individual genomes, providing high-quality draft models and an accessory reactions catalog supporting the gapfilling step. This approach will improve our comprehension of metabolic functions of uncultured species.
Response to "Neglecting normalization impact in semi-synthetic RNA-seq data simulation generates artificial false positives" and "Winsorization greatly reduces false positives by popular differential expression methods when analyzing human population samples"
Two correspondences raised concerns or comments about our analyses regarding exaggerated false positives found by differential expression (DE) methods. Here, we discuss the points they raise and explain why we agree or disagree with these points. We add new analysis to confirm that the Wilcoxon rank-sum test remains the most robust method compared to the other five DE methods (DESeq2, edgeR, limma-voom, dearseq, and NOISeq) in two-condition DE analyses after considering normalization and winsorization, the data preprocessing steps discussed in the two correspondences.
Neglecting the impact of normalization in semi-synthetic RNA-seq data simulations generates artificial false positives
A recent study reported exaggerated false positives by popular differential expression methods when analyzing large population samples. We reproduce the differential expression analysis simulation results and identify a caveat in the data generation process. Data not truly generated under the null hypothesis led to incorrect comparisons of benchmark methods. We provide corrected simulation results that demonstrate the good performance of dearseq and argue against the superiority of the Wilcoxon rank-sum test as suggested in the previous study.
Winsorization greatly reduces false positives by popular differential expression methods when analyzing human population samples
A recent study found severely inflated type I error rates for DESeq2 and edgeR, two dominant tools used for differential expression analysis of RNA-seq data. Here, we show that by properly addressing the outliers in the RNA-Seq data using winsorization, the type I error rate of DESeq2 and edgeR can be substantially reduced, and the power is comparable to Wilcoxon rank-sum test for large datasets. Therefore, as an alternative to Wilcoxon rank-sum test, they may still be applied for differential expression analysis of large RNA-Seq datasets.
Benchmarking and building DNA binding affinity models using allele-specific and allele-agnostic transcription factor binding data
Transcription factors (TFs) bind to DNA in a highly sequence-specific manner. This specificity manifests itself in vivo as differences in TF occupancy between the two alleles at heterozygous loci. Genome-scale assays such as ChIP-seq currently are limited in their power to detect allele-specific binding (ASB) both in terms of read coverage and representation of individual variants in the cell lines used. This makes prediction of allelic differences in TF binding from sequence alone desirable, provided that the reliability of such predictions can be quantitatively assessed.