Identifying T-cell clubs by embracing the local harmony between TCR and gene expressions
T cell receptors (TCR) and gene expression provide two complementary and essential aspects in T cell understanding, yet their diversity presents challenges in integrative analysis. We introduce TCRclub, a novel method integrating single-cell RNA sequencing data and single-cell TCR sequencing data using local harmony to identify functionally similar T cell groups, termed 'clubs'. We applied TCRclub to 298,106 T cells across seven datasets encompassing various diseases. First, TCRclub outperforms the state-of-the-art methods in clustering T cells on a dataset with over 400 verified peptide-major histocompatibility complex categories. Second, TCRclub reveals a transition from activated to exhausted T cells in cholangiocarcinoma patients. Third, TCRclub discovered the pathways that could intervene in response to anti-PD-1 therapy for patients with basal cell carcinoma by analyzing the pre-treatment and post-treatment samples. Furthermore, TCRclub unveiled different T-cell responses and gene patterns at different severity levels in patients with COVID-19. Hence, TCRclub aids in developing more effective immunotherapeutic strategies for cancer and infectious diseases.
Subcellular mRNA kinetic modeling reveals nuclear retention as rate-limiting
Eukaryotic mRNAs are transcribed, processed, translated, and degraded in different subcellular compartments. Here, we measured mRNA flow rates between subcellular compartments in mouse embryonic stem cells. By combining metabolic RNA labeling, biochemical fractionation, mRNA sequencing, and mathematical modeling, we determined the half-lives of nuclear pre-, nuclear mature, cytosolic, and membrane-associated mRNAs from over 9000 genes. In addition, we estimated transcript elongation rates. Many matured mRNAs have long nuclear half-lives, indicating nuclear retention as the rate-limiting step in the flow of mRNAs. In contrast, mRNA transcripts coding for transcription factors show fast kinetic rates, and in particular short nuclear half-lives. Differentially localized mRNAs have distinct rate constant combinations, implying modular regulation. Membrane stability is high for membrane-localized mRNA and cytosolic stability is high for cytosol-localized mRNA. mRNAs encoding target signals for membranes have low cytosolic and high membrane half-lives with minor differences between signals. Transcripts of nuclear-encoded mitochondrial proteins have long nuclear retention and cytoplasmic kinetics that do not reflect co-translational targeting. Our data and analyses provide a useful resource to study spatiotemporal gene expression regulation.
Author Correction: Predictive evolution of metabolic phenotypes using model-designed environments
[Image: see text]
Enhancers and genome conformation provide complex transcriptional control of a herpesviral gene
Complex transcriptional control is a conserved feature of both eukaryotes and the viruses that infect them. Despite viral genomes being smaller and more gene dense than their hosts, we generally lack a sense of scope for the features governing the transcriptional output of individual viral genes. Even having a seemingly simple expression pattern does not imply that a gene's underlying regulation is straightforward. Here, we illustrate this by combining high-density functional genomics, expression profiling, and viral-specific chromosome conformation capture to define with unprecedented detail the transcriptional regulation of a single gene from Kaposi's sarcoma-associated herpesvirus (KSHV). We used as our model KSHV ORF68 - which has simple, early expression kinetics and is essential for viral genome packaging. We first identified seven cis-regulatory regions involved in ORF68 expression by densely tiling the ~154 kb KSHV genome with dCas9 fused to a transcriptional repressor domain (CRISPRi). A parallel Cas9 nuclease screen indicated that three of these regions act as promoters of genes that regulate ORF68. RNA expression profiling demonstrated that three more of these regions act by either repressing or enhancing other distal viral genes involved in ORF68 transcriptional regulation. Finally, we tracked how the 3D structure of the viral genome changes during its lifecycle, revealing that these enhancing regulatory elements are physically closer to their targets when active, and that disrupting some elements caused large-scale changes to the 3D genome. These data enable us to construct a complete model revealing that the mechanistic diversity of this essential regulatory circuit matches that of human genes.
Tissue-aware interpretation of genetic variants advances the etiology of rare diseases
Pathogenic variants underlying Mendelian diseases often disrupt the normal physiology of a few tissues and organs. However, variant effect prediction tools that aim to identify pathogenic variants are typically oblivious to tissue contexts. Here we report a machine-learning framework, denoted "Tissue Risk Assessment of Causality by Expression for variants" (TRACEvar, https://netbio.bgu.ac.il/TRACEvar/ ), that offers two advancements. First, TRACEvar predicts pathogenic variants that disrupt the normal physiology of specific tissues. This was achieved by creating 14 tissue-specific models that were trained on over 14,000 variants and combined 84 attributes of genetic variants with 495 attributes derived from tissue omics. TRACEvar outperformed 10 well-established and tissue-oblivious variant effect prediction tools. Second, the resulting models are interpretable, thereby illuminating variants' mode of action. Application of TRACEvar to variants of 52 rare-disease patients highlighted pathogenicity mechanisms and relevant disease processes. Lastly, the interpretation of all tissue models revealed that top-ranking determinants of pathogenicity included attributes of disease-affected tissues, particularly cellular process activities. Collectively, these results show that tissue contexts and interpretable machine-learning models can greatly enhance the etiology of rare diseases.
Prediction of the 3D cancer genome from whole-genome sequencing using InfoHiC
The 3D genome prediction in cancer is crucial for uncovering the impact of structural variations (SVs) on tumorigenesis, especially when they are present in noncoding regions. We present InfoHiC, a systemic framework for predicting the 3D cancer genome directly from whole-genome sequencing (WGS). InfoHiC utilizes contig-specific copy number encoding on the SV contig assembly, and performs a contig-to-total Hi-C conversion for the cancer Hi-C prediction from multiple SV contigs. We showed that InfoHiC can predict 3D genome folding from all types of SVs using breast cancer cell line data. We applied it to WGS data of patients with breast cancer and pediatric patients with medulloblastoma, and identified neo topologically associating domains. For breast cancer, we discovered super-enhancer hijacking events associated with oncogenic overexpression and poor survival outcomes. For medulloblastoma, we found SVs in noncoding regions that caused super-enhancer hijacking events of medulloblastoma driver genes (GFI1, GFI1B, and PRDM6). In addition, we provide trained models for cancer Hi-C prediction from WGS at https://github.com/dmcb-gist/InfoHiC , uncovering the impacts of SVs in cancer patients and revealing novel therapeutic targets.
Proteome-wide copy-number estimation from transcriptomics
Protein copy numbers constrain systems-level properties of regulatory networks, but proportional proteomic data remain scarce compared to RNA-seq. We related mRNA to protein statistically using best-available data from quantitative proteomics and transcriptomics for 4366 genes in 369 cell lines. The approach starts with a protein's median copy number and hierarchically appends mRNA-protein and mRNA-mRNA dependencies to define an optimal gene-specific model linking mRNAs to protein. For dozens of cell lines and primary samples, these protein inferences from mRNA outmatch stringent null models, a count-based protein-abundance repository, empirical mRNA-to-protein ratios, and a proteogenomic DREAM challenge winner. The optimal mRNA-to-protein relationships capture biological processes along with hundreds of known protein-protein complexes, suggesting mechanistic relationships. We use the method to identify a viral-receptor abundance threshold for coxsackievirus B3 susceptibility from 1489 systems-biology infection models parameterized by protein inference. When applied to 796 RNA-seq profiles of breast cancer, inferred copy-number estimates collectively re-classify 26-29% of luminal tumors. By adopting a gene-centered perspective of mRNA-protein covariation across different biological contexts, we achieve accuracies comparable to the technical reproducibility of contemporary proteomics.
Global atlas of predicted functional domains in Legionella pneumophila Dot/Icm translocated effectors
Legionella pneumophila utilizes the Dot/Icm type IVB secretion system to deliver hundreds of effector proteins inside eukaryotic cells to ensure intracellular replication. Our understanding of the molecular functions of the largest pathogenic arsenal known to the bacterial world remains incomplete. By leveraging advancements in 3D protein structure prediction, we provide a comprehensive structural analysis of 368 L. pneumophila effectors, representing a global atlas of predicted functional domains summarized in a database ( https://pathogens3d.org/legionella-pneumophila ). Our analysis identified 157 types of diverse functional domains in 287 effectors, including 159 effectors with no prior functional annotations. Furthermore, we identified 35 cryptic domains in 30 effector models that have no similarity with experimentally structurally characterized proteins, thus, hinting at novel functionalities. Using this analysis, we demonstrate the activity of thirteen functional domains, including three cryptic domains, predicted in L. pneumophila effectors to cause growth defects in the Saccharomyces cerevisiae model system. This illustrates an emerging strategy of exploring synergies between predictions and targeted experimental approaches in elucidating novel effector activities involved in infection.
Correction of a widespread bias in pooled chemical genomics screens improves their interpretability
Chemical genomics is a powerful and increasingly accessible technique to probe gene function, gene-gene interactions, and antibiotic synergies and antagonisms. Indeed, multiple large-scale pooled datasets in diverse organisms have been published. Here, we identify an artifact arising from uncorrected differences in the number of cell doublings between experiments within such datasets. We demonstrate that this artifact is widespread, show how it causes spurious gene-gene and drug-drug correlations, and present a simple but effective post hoc method for removing its effects. Using several published datasets, we demonstrate that this correction removes spurious correlations between genes and conditions, improving data interpretability and revealing new biological insights. Finally, we determine experimental factors that predispose a dataset for this artifact and suggest a set of experimental and computational guidelines for performing pooled chemical genomics experiments that will maximize the potential of this powerful technique.
Author Correction: From coarse to fine: the absolute Escherichia coli proteome under diverse growth conditions
[Image: see text]
High-throughput protein characterization by complementation using DNA barcoded fragment libraries
Our ability to predict, control, or design biological function is fundamentally limited by poorly annotated gene function. This can be particularly challenging in non-model systems. Accordingly, there is motivation for new high-throughput methods for accurate functional annotation. Here, we used complementation of auxotrophs and DNA barcode sequencing (Coaux-Seq) to enable high-throughput characterization of protein function. Fragment libraries from eleven genetically diverse bacteria were tested in twenty different auxotrophic strains of Escherichia coli to identify genes that complement missing biochemical activity. We recovered 41% of expected hits, with effectiveness ranging per source genome, and observed success even with distant E. coli relatives like Bacillus subtilis and Bacteroides thetaiotaomicron. Coaux-Seq provided the first experimental validation for 53 proteins, of which 11 are less than 40% identical to an experimentally characterized protein. Among the unexpected function identified was a sulfate uptake transporter, an O-succinylhomoserine sulfhydrylase for methionine synthesis, and an aminotransferase. We also identified instances of cross-feeding wherein protein overexpression and nearby non-auxotrophic strains enabled growth. Altogether, Coaux-Seq's utility is demonstrated, with future applications in ecology, health, and engineering.
XCMS-METLIN: data-driven metabolite, lipid, and chemical analysis
In this Correspondence, G. Siuzdak and colleagues present XCMS-METLIN, an extensive resource for metabolomics, lipidomics, and chemical analysis. [Image: see text]
Highly parallelized laboratory evolution of wine yeasts for enhanced metabolic phenotypes
Adaptive Laboratory Evolution (ALE) of microorganisms can improve the efficiency of sustainable industrial processes important to the global economy. However, stochasticity and genetic background effects often lead to suboptimal outcomes during laboratory evolution. Here we report an ALE platform to circumvent these shortcomings through parallelized clonal evolution at an unprecedented scale. Using this platform, we evolved 10 yeast populations in parallel from many strains for eight desired wine fermentation-related traits. Expansions of both ALE replicates and lineage numbers broadened the evolutionary search spectrum leading to improved wine yeasts unencumbered by unwanted side effects. At the genomic level, evolutionary gains in metabolic characteristics often coincided with distinct chromosome amplifications and the emergence of side-effect syndromes that were characteristic of each selection niche. Several high-performing ALE strains exhibited desired wine fermentation kinetics when tested in larger liquid cultures, supporting their suitability for application. More broadly, our high-throughput ALE platform opens opportunities for rapid optimization of microbes which otherwise could take many years to accomplish.
Bacterial live therapeutics for human diseases
The genomic revolution has fueled rapid progress in synthetic and systems biology, opening up new possibilities for using live biotherapeutic products (LBP) to treat, attenuate or prevent human diseases. Among LBP, bacteria-based therapies are particularly promising due to their ability to colonize diverse human tissues, modulate the immune system and secrete or deliver complex biological products. These bacterial LBP include engineered pathogenic species designed to target specific diseases, and microbiota species that promote microbial balance and immune system homeostasis, either through local administration or the gut-body axes. This review focuses on recent advancements in preclinical and clinical trials of bacteria-based LBP, highlighting both on-site and long-reaching strategies.
Yeast9: a consensus genome-scale metabolic model for S. cerevisiae curated by the community
Genome-scale metabolic models (GEMs) can facilitate metabolism-focused multi-omics integrative analysis. Since Yeast8, the yeast-GEM of Saccharomyces cerevisiae, published in 2019, has been continuously updated by the community. This has increased the quality and scope of the model, culminating now in Yeast9. To evaluate its predictive performance, we generated 163 condition-specific GEMs constrained by single-cell transcriptomics from osmotic pressure or reference conditions. Comparative flux analysis showed that yeast adapting to high osmotic pressure benefits from upregulating fluxes through central carbon metabolism. Furthermore, combining Yeast9 with proteomics revealed metabolic rewiring underlying its preference for nitrogen sources. Lastly, we created strain-specific GEMs (ssGEMs) constrained by transcriptomics for 1229 mutant strains. Well able to predict the strains' growth rates, fluxomics from those large-scale ssGEMs outperformed transcriptomics in predicting functional categories for all studied genes in machine learning models. Based on those findings we anticipate that Yeast9 will continue to empower systems biology studies of yeast metabolism.
Deep quantification of substrate turnover defines protease subsite cooperativity
Substrate specificity determines protease functions in physiology and in clinical and biotechnological applications, yet quantitative cleavage information is often unavailable, biased, or limited to a small number of events. Here, we develop qPISA (quantitative Protease specificity Inference from Substrate Analysis) to study Dipeptidyl Peptidase Four (DPP4), a key regulator of blood glucose levels. We use mass spectrometry to quantify >40,000 peptides from a complex, commercially available peptide mixture. By analyzing changes in substrate levels quantitatively instead of focusing on qualitative product identification through a binary classifier, we can reveal cooperative interactions within DPP4's active pocket and derive a sequence motif that predicts activity quantitatively. qPISA distinguishes DPP4 from the related C. elegans DPF-3 (a DPP8/9-orthologue), and we relate the differences to the structural features of the two enzymes. We demonstrate that qPISA can direct protein engineering efforts like the stabilization of GLP-1, a key DPP4 substrate used in the treatment of diabetes and obesity. Thus, qPISA offers a versatile approach for profiling protease and especially exopeptidase specificity, facilitating insight into enzyme mechanisms and biotechnological and clinical applications.
Dengue virus preferentially uses human and mosquito non-optimal codons
Codon optimality refers to the effect that codon composition has on messenger RNA (mRNA) stability and translation level and implies that synonymous codons are not silent from a regulatory point of view. Here, we investigated the adaptation of virus genomes to the host optimality code using mosquito-borne dengue virus (DENV) as a model. We demonstrated that codon optimality exists in mosquito cells and showed that DENV preferentially uses nonoptimal (destabilizing) codons and avoids codons that are defined as optimal (stabilizing) in either human or mosquito cells. Human genes enriched in the codons preferentially and frequently used by DENV are upregulated during infection, and so is the tRNA decoding the nonoptimal and DENV preferentially used codon for arginine. We found that adaptation during single-host passaging in human or mosquito cells results in the selection of synonymous mutations towards DENV's preferred nonoptimal codons that increase virus fitness. Finally, our analyses revealed that hundreds of viruses preferentially use nonoptimal codons, with those infecting a single host displaying an even stronger bias, suggesting that host-pathogen interaction shapes virus-synonymous codon choice.
Somatic CpG hypermutation is associated with mismatch repair deficiency in cancer
Somatic hypermutation in cancer has gained momentum with the increased use of tumour mutation burden as a biomarker for immune checkpoint inhibitors. Spontaneous deamination of 5-methylcytosine to thymine at CpG dinucleotides is one of the most ubiquitous endogenous mutational processes in normal and cancer cells. Here, we performed a systematic investigation of somatic CpG hypermutation at a pan-cancer level. We studied 30,191 cancer patients and 103 cancer types and developed an algorithm to identify somatic CpG hypermutation. Across cancer types, we observed the highest prevalence in paediatric leukaemia (3.5%), paediatric high-grade glioma (1.7%), and colorectal cancer (1%). We discovered germline variants and somatic mutations in the mismatch repair complex MutSα (MSH2-MSH6) as genetic drivers of somatic CpG hypermutation in cancer, which frequently converged on CpG sites and TP53 driver mutations. We further observe an association between somatic CpG hypermutation and response to immune checkpoint inhibitors. Overall, our study identified novel cancer types that display somatic CpG hypermutation, strong association with MutSα-deficiency, and potential utility in cancer immunotherapy.
Rescuing error control in crosslinking mass spectrometry
Crosslinking mass spectrometry is a powerful tool to study protein-protein interactions under native or near-native conditions in complex mixtures. Through novel search controls, we show how biassing results towards likely correct proteins can subtly undermine error estimation of crosslinks, with significant consequences. Without adjustments to address this issue, we have misidentified an average of 260 interspecies protein-protein interactions across 16 analyses in which we synthetically mixed data of different species, misleadingly suggesting profound biological connections that do not exist. We also demonstrate how data analysis procedures can be tested and refined to restore the integrity of the decoy-false positive relationship, a crucial element for reliably identifying protein-protein interactions.
Time-resolved interactome profiling deconvolutes secretory protein quality control dynamics
Many cellular processes are governed by protein-protein interactions that require tight spatial and temporal regulation. Accordingly, it is necessary to understand the dynamics of these interactions to fully comprehend and elucidate cellular processes and pathological disease states. To map de novo protein-protein interactions with time resolution at an organelle-wide scale, we developed a quantitative mass spectrometry method, time-resolved interactome profiling (TRIP). We apply TRIP to elucidate aberrant protein interaction dynamics that lead to the protein misfolding disease congenital hypothyroidism. We deconvolute altered temporal interactions of the thyroid hormone precursor thyroglobulin with pathways implicated in hypothyroidism pathophysiology, such as Hsp70-/90-assisted folding, disulfide/redox processing, and N-glycosylation. Functional siRNA screening identified VCP and TEX264 as key protein degradation components whose inhibition selectively rescues mutant prohormone secretion. Ultimately, our results provide novel insight into the temporal coordination of protein homeostasis, and our TRIP method should find broad applications in investigating protein-folding diseases and cellular processes.
Identification of novel toxins associated with the extracellular contractile injection system using machine learning
Secretion systems play a crucial role in microbe-microbe or host-microbe interactions. Among these systems, the extracellular contractile injection system (eCIS) is a unique bacterial and archaeal extracellular secretion system that injects protein toxins into target organisms. However, the specific proteins that eCISs inject into target cells and their functions remain largely unknown. Here, we developed a machine learning classifier to identify eCIS-associated toxins (EATs). The classifier combines genetic and biochemical features to identify EATs. We also developed a score for the eCIS N-terminal signal peptide to predict EAT loading. Using the classifier we classified 2,194 genes from 950 genomes as putative EATs. We validated four new EATs, EAT14-17, showing toxicity in bacterial and eukaryotic cells, and identified residues of their respective active sites that are critical for toxicity. Finally, we show that EAT14 inhibits mitogenic signaling in human cells. Our study provides insights into the diversity and functions of EATs and demonstrates machine learning capability of identifying novel toxins. The toxins can be employed in various applications dependently or independently of eCIS.