Evolution of Plant Genome Size and Composition
The rapid development of sequencing technology has led to an explosion of plant genome data, opening up more opportunities for research in the field of comparative evolutionary analysis of plant genomes. In this review, we take changes in plant genome size and composition as a starting point and describe the effects of polyploidy, whole genome duplication and transposable elements changes on plant genome architecture and evolution, respectively. In addition, to address the lack of relevant information in some areas, we also collected and analyzed 234 representative plant genome data as a supplement. We aim to provide a global, up-to-date summary of information on plant genome architecture and evolution in this review.
Enzymes Repertoires and Genomic Insights into Lycium Barbarum Pectin Polysaccharides Biosynthesis
Lycium barbarum, a member of the Solanaceae family, represents an important eudicot lineage with homology of food and medicine. Lycium barbarum pectin polysaccharides (LBPPs) are key bioactive ingredients of Lycium barbarum, and are among the few polysaccharides with both biocompatibility and biomedical activity. While previous studies have primarily focused on the functional properties of LBPPs, the mechanisms of biosynthesis and transport by key enzymes remain poorly understood. Here, we reported the completion of a 2.18-gigabase reference genome of Lycium barbarum, reconstructed the first entire biosynthesis of pectin polysaccharides and sugar transport, and characterized the important genes responsible for backbone extending, sidechain synthesis, and modification of pectin polysaccharides. Additionally, we characterized long non-coding RNAs (lncRNAs) associated with polysaccharide metabolism and identified a specific rhamnogalacturonan I (RG-I) rhamnosyltransferase, RRT3020, which enhances RG-I biosynthesis in LBPPs. These newly identified enzymes and pivotal genes endow L. barbarum with specific pectin biosynthesis capabilities, distinguishing it from other Solanaceae species. Our findings provide a foundation for evolutionary studies and molecular breeding to enhance the diverse applications of L. barbarum.
SoyOD: An Integrated Soybean Multi-omics Database for Mining Genes and Biological Research
Soybean is a globally important crop for food, feed, oil, and nitrogen fixation. A variety of multi-omics studies has been carried out, generating datasets ranging from genotype to phenotype. In order to efficiently utilize these data for basic and applied research, a soybean multi-omics database with extensive data coverage and comprehensive data analysis tools was established. The Soybean Omics Database (SoyOD) integrates important new datasets with existing public datasets to form the most comprehensive collection of soybean multi-omics information. Compared to existing soybean databases, SoyOD incorporates an extensive collection of novel data derived from the deep-sequencing of 984 germplasms, 162 novel transcriptome datasets from seeds at different developmental stages, 53 phenotypic datasets, and more than 2500 phenotypic images. In addition, SoyOD integrates existing data resources, including 59 assembled genomes, genetic variation data from 3904 soybean accessions, 225 sets of phenotypic data, and 1097 transcriptomic sequences covering 507 different tissues and treatment conditions. Moreover, SoyOD can be used to mine candidate genes for important agronomic traits, as shown in a case study on plant height. Additionally, powerful analytical and easy-to-use toolkits enable users to easily access the available multi-omics datasets, and to rapidly search genotypic and phenotypic data in a particular germplasm. The novelty, comprehensiveness, and user-friendly features of SoyOD make it a valuable resource for soybean molecular breeding and biological research. SoyOD is publicly accessible at https://bis.zju.edu.cn/soyod.
Pangenome Reveals Gene Content Variations and Structural Variants Contributing to Pig Characteristics
Pigs are among the most essential sources of high-quality protein in human diets. Structural variants (SVs) are a major source of genetic variants associated with diverse traits and evolutionary events. However, the current linear reference genome of pigs limits the presentation of position information for SVs. In this study, we generated a pangenome of pigs and a genome variation map of 599 deep-sequenced genomes across Eurasia. Moreover, a section-wide gene repertoire was constructed, which indicated that core genes were more evolutionarily conserved than variable genes. Subsequently, we identified 546,137 SVs, their enrichment regions, and relationships with genomic features and found significant divergence across Eurasian pigs. More importantly, the pangenome-detected SVs could complement heritability estimates and genome-wide association studies based only on single nucleotide polymorphisms. Among the SVs shaped by selection, we identified an insertion in the promoter region of the TBX19 gene, which may be related to the development, growth, and timidity traits of Asian pigs and may affect the gene expression. Our constructed pig pangenome and the identified SVs provide rich resources for future functional genomic research on pigs.
VISTA: A Tool for Fast Taxonomic Assignment of Viral Genome Sequences
The rapid expansion of the number of viral genome sequences in public databases necessitates a scalable, universal, and automated preliminary taxonomic framework for comprehensive virus studies. Here, we introduce VISTA (Virus Sequence-based Taxonomy Assignment), a computational tool that employs a novel pairwise sequence comparison system and an automatic demarcation threshold identification framework for virus taxonomy. Leveraging physio-chemical property sequences, k-mer profiles, and machine learning techniques, VISTA constructs a robust distance-based framework for taxonomic assignment. Functionally similar to PASC (Pairwise Sequence Comparison), a widely used virus assignment tool based on pairwise sequence comparison, VISTA demonstrates superior performance by providing significantly improved separation for taxonomic groups, more objective taxonomic demarcation thresholds, greatly enhanced speed, and a wider application scope. We successfully applied VISTA to 38 virus families, as well as to the class Caudoviricetes. This demonstrates VISTA's scalability, robustness, and ability to automatically and accurately assign taxonomy to both prokaryotic and eukaryotic viruses. Furthermore, the application of VISTA to 679 unclassified prokaryotic virus genomes recovered from metagenomic data identified 46 novel virus families. VISTA is available as both a command line tool and a user-friendly web portal at https://ngdc.cncb.ac.cn/vista.
ProtPipe: A Multifunctional Data Analysis Pipeline for Proteomics and Peptidomics
Mass spectrometry (MS) is a technique widely employed for the identification and characterization of proteins, with personalized medicine, systems biology, and biomedical applications. The application of MS-based proteomics advances our understanding of protein function, cellular signaling, and complex biological systems. MS data analysis is a critical process that includes identifying and quantifying proteins and peptides and then exploring their biological functions in downstream analysis. To address the complexities associated with MS data analysis, we developed ProtPipe to streamline and automate the processing and analysis of high-throughput proteomics and peptidomics datasets with DIA-NN preinstalled. The pipeline facilitates data quality control, sample filtering, and normalization, ensuring robust and reliable downstream analyses. ProtPipe provides downstream analyses, including protein and peptide differential abundance identification, pathway enrichment analysis, protein-protein interaction analysis, and Major histocompatibility complex (MHC) -peptide binding affinity analysis. ProtPipe generates annotated tables and visualizations by performing statistical postprocessing and calculating fold changes between predefined pairwise conditions in an experimental design. It is an open-source, well-documented tool available online at https://github.com/NIH-CARD/ProtPipe, with a user-friendly web interface.
iMFP-LG: Identification of Novel Multi-Functional Peptides by Using Protein Language Models and Graph-Based Deep Learning
Functional peptides are short amino acid fragments that have a wide range of beneficial functions for living organisms. The majority of previous research focused on mono-functional peptides, but a growing number of multi-functional peptides have been discovered. Although there have been enormous experimental efforts to assay multi-functional peptides, only a small fraction of millions of known peptides have been explored. Effective and precise techniques for identifying multi-functional peptides can facilitate their discovery and mechanistic understanding. In this article, we presented a method iMFP-LG for identifying multi-functional peptides based on protein language models (pLMs) and graph attention networks (GATs). Comparison results showed that iMFP-LG outperforms state-of-the-art methods on both multi-functional bioactive peptides and multi-functional therapeutic peptides datasets. The interpretability of iMFP-LG was also illustrated by visualizing attention patterns in pLMs and GATs. Regarding the outstanding performance of iMFP-LG on the identification of multi-functional peptides, we employed iMFP-LG to screen novel candidate peptides with both ACP and AMP functions from millions of known peptides in the UniRef90. As a result, 8 candidate peptides were identified, and 1 candidate that exhibits both antibacterial and anticancer effects was confirmed through molecular structure alignment and biological experiments. We anticipate that iMFP-LG can assist in the discovery of multi-functional peptides and contribute to the advancement of peptide drug design.
Comprehensive Characterization of the Integrin Family Across 32 Cancer Types
Integrin genes are widely involved in tumorigenesis. Yet, a comprehensive characterization of integrin family members and their interactome at the pan-cancer level is lacking. Here, we systematically analyzed integrin family in approximately 10,000 tumors across 32 cancer types. Globally, integrins represent a frequently altered and misexpressed pathway, with alteration and dysregulation overall being protumorigenic. Expression dysregulation, better than mutational landscape, of integrin family successfully identifies a subgroup of aggressive tumors with a high level of proliferation and stemness. The results reveal that several molecular mechanisms collectively regulate integrin expression in a context-dependent manner. For potential clinical usage, we constructed a weighted scoring system, integrinScore, to measure integrin signaling patterns in individual tumors. Remarkably, integrinScore was consistently correlated with predefined molecular subtypes in multiple cancers, with integrinScore-high tumors being more aggressive. Importantly, integrinScore was cancer-dependent and closely associated with proliferation, stemness, tumor microenvironment, metastasis, and immune signatures. IntegrinScore also predicted patients' response to immunotherapy. By mining drug databases, we unraveled an array of compounds that may modulate integrin signaling. Finally, we built a user-friendly database, Pan-cancer Integrin Explorer (PIExplorer; http://computationalbiology.cn/PIExplorer), to facilitate researchers to explore integrin-related knowledge. Collectively, we provide a comprehensive characterization of integrins across cancers and offer gene-specific and cancer-specific rationales for developing integrin-targeted therapy.
Multi-omics Mediated Wide Association Studies: Novel Approaches for Understanding Diseases
The rapid development of multi-omics (transcriptome, proteome, cistrome, imaging, and regulome) mediated wide association studies methods have opened new avenues for biologists to understand the susceptibility genes underlying complex diseases. Thorough comparisons of these methods are essential for selecting the most appropriate tool for a given research objective. This review provides a detailed categorization and summary of the statistical models, use cases, and advantages of recent multi-omics mediated wide association studies. In addition, to illustrate gene-disease association studies based on transcriptome-wide association studies (TWAS), we collected 478 disease entries across 22 categories from 235 manually reviewed publications. Our analysis reveals that mental disorders are the most frequently studied by TWAS, indicating its potential to deepen our understanding of the genetic architecture of complex diseases. In summary, this review underscores the importance of multi-omics mediated wide association studies in elucidating complex diseases and highlights the significance of selecting the appropriate method for each study.
TCRosetta: An Integrated Analysis and Annotation Platform for T-cell Receptor Sequences
T cells and T-cell receptors (TCRs) are essential components of the adaptive immune system. Characterization of the TCR repertoire offers a promising and highly informative source for understanding the functions of T cells in the immune response and immunotherapy. Although TCR repertoire studies have attracted much attention, there are few online servers available for TCR repertoire analysis, especially for TCR sequence annotation or advanced analyses. Therefore, we developed TCRosetta, a comprehensive online server that integrates analytical methods for TCR repertoire analysis and visualization. TCRosetta combines general feature analysis, large-scale sequence clustering, network construction, peptide-TCR binding prediction, generation probability calculation, and k-mer motif analysis for TCR sequences, making TCR data analysis as simple as possible. The TCRosetta server accepts multiple input data formats and can analyze ∼ 20,000 TCR sequences in less than 3 min. TCRosetta is the most comprehensive web server available for TCR repertoire analysis and is freely available at https://guolab.wchscu.cn/TCRosetta/.
Bioinformatic Resources for Exploring Human-virus Protein-protein Interactions Based on Binding Modes
Historically, there have been many outbreaks of viral diseases that have continued to claim millions of lives. Research on human-virus protein-protein interactions (PPIs) is vital to understanding the principles of human-virus relationships, providing an essential foundation for developing virus control strategies to combat diseases. The rapidly accumulating data on human-virus PPIs offer unprecedented opportunities for bioinformatics research around human-virus PPIs. However, available detailed analyses and summaries to help use these resources systematically and efficiently are lacking. Here, we comprehensively review the bioinformatic tools used in human-virus PPIs research, discuss and compare the function, performance, and limitations of these web resources. This study aims to provide researchers with a bioinformatic toolbox that will hopefully better facilitate the exploration of human-virus PPIs based on binding modes.
DeOri 10.0: An Updated Database of Experimentally Identified Eukaryotic Replication Origins
DNA replication is a complex and crucial biological process in eukaryotes. To facilitate the study of eukaryotic replication events, we present a database of eukaryotic DNA replication origins (DeOri), which collects scattered data and integrates extensive sequencing data on eukaryotic DNA replication origins. With continuous updates of DeOri, the number of datasets in the new release increased from 10 to 151 and the number of sequences increased from 16,145 to 9,742,396. Besides nucleotide sequences and bed files, corresponding annotation files, such as coding sequences (CDS), mRNA, and other biological elements within replication origins, are also provided. The experimental techniques used for each dataset, as well as other statistical data, are also presented on web page. Differences in experimental methods, cell lines, and sequencing technologies have resulted in distinct replication origins, making it challenging to differentiate between cell-specific and non-specific replication. We combined multiple replication origins at the species level, scored them, and screened them. The screened regions were considered as species-conservative origins. They are integrated and presented as reference replication origins (rORIs), including Homo sapiens, Gallus gallus, Mus musculus, Drosophila melanogaster, and Caenorhabditis elegans. Additionally, we analyzed the distribution of relevant genomic elements associated with replication origins at the genome level, such as CpG island (CGI), transcription start site (TSS), and G-quadruplex (G4). These analysis results allow users to select the desired data based on it. DeOri is available at http://tubic.tju.edu.cn/deori/.
MitoSort: Robust Demultiplexing of Pooled Single-cell Genomics Data Using Endogenous Mitochondrial Variants
Multiplexing across donors has emerged as a popular strategy to increase throughput, reduce costs, overcome technical batch effects, and improve doublet detection in single-cell genomic studies. To eliminate additional experimental steps, endogenous nuclear genome variants are used for demultiplexing pooled single-cell RNA sequencing (scRNA-seq) data by several computational tools. However, these tools have limitations when applied to single-cell sequencing methods that do not cover nuclear genomic regions well, such as single-cell assay for transposase-accessible chromatin with sequencing (scATAC-seq). Here, we demonstrate that mitochondrial germline variants are an alternative, robust, and computationally efficient endogenous barcode for sample demultiplexing. We propose MitoSort, a tool that uses mitochondrial germline variants to assign cells to their donor of origin and identify cross-genotype doublets in single-cell genomics datasets. We evaluate its performance by using in silico pooled mitochondrial scATAC-seq (mtscATAC-seq) libraries and experimentally multiplexed data with cell hashtags. MitoSort achieves high accuracy and efficiency in genotype clustering and doublet detection for mtscATAC-seq data, addressing the limitations of current computational techniques tailored for scRNA-seq data. Moreover, MitoSort exhibits versatility and can be applied to various single-cell sequencing approaches beyond mtscATAC-seq, provided the mitochondrial variants are reliably detected. Furthermore, we demonstrate the application of MitoSort in a case study where B cells from eight donors were pooled and assayed by single-cell multi-omics sequencing. Altogether, our results demonstrate the accuracy and efficiency of MitoSort, which enables reliable sample demultiplexing in various single-cell genomic applications. MitoSort is available at https://github.com/tangzhj/MitoSort.
Aberrant Somatic Hypermutation at Super-enhancer Drives B Cell Lymphoma Transformation
Centromere Landscapes Resolved from Hundreds of Human Genomes
High-fidelity (HiFi) sequencing has facilitated the assembly and analysis of the most repetitive region of the genome, the centromere. Nevertheless, our current understanding of human centromeres is based on a relatively small number of telomere-to-telomere assemblies, which has not yet captured its full diversity. In this study, we investigated the genomic diversity of human centromere higher order repeats (HORs) via both HiFi reads and haplotype-resolved assemblies from hundreds of samples drawn from ongoing pangenome-sequencing projects and reprocessed them via a novel HOR annotation pipeline, HiCAT-human. We used this wealth of data to provide a global survey of the centromeric HOR landscape; in particular, we found that 23 HORs presented significant copy number variability between populations. We detected three centromere genotypes with unbalanced population frequencies on chromosomes 5, 8, and 17. An inter-assembly comparison of HOR loci further revealed that while HOR array structures are diverse, they nevertheless tend to form a number of specific landscapes, each exhibiting different levels of HOR subunit expansion and possibly reflecting a cyclical evolutionary transition from homogeneous to nested structures and back.
Scm6A: A Fast and Low-cost Method for Quantifying m6A Modifications at the Single-cell Level
It is widely accepted that N6-methyladenosine (m6A) exhibits significant intercellular specificity, which poses challenges for its detection using existing m6A quantitative methods. In this study, we introduced Single-cell m6A Analysis (Scm6A), a machine learning-based approach for single-cell m6A quantification. Scm6A leverages input features derived from the expression levels of m6A trans regulators and cis sequence features, and offers remarkable prediction efficiency and reliability. To further validate the robustness and precision of Scm6A, we first applied Scm6A to single-cell RNA sequencing (scRNA-seq) data from peripheral blood mononuclear cells (PBMCs) and calculated the m6A levels in CD4+ and CD8+ T cells. We also applied a winscore-based m6A calculation method to conduct N6-methyladenosine sequencing (m6A-seq) analysis on CD4+ and CD8+ T cells isolated through magnetic-activated cell sorting (MACS) from the same samples. Notably, the m6A levels calculated by Scm6A exhibited a significant positive correlation with those quantified through m6A-seq in different cells isolated by MACS, providing compelling evidence for Scm6A's reliability. Additionally, we performed single-cell-level m6A analysis on lung cancer tissues as well as blood samples from patients with coronavirus disease 2019 (COVID-19), and demonstrated the landscape and regulatory mechanisms of m6A in different T cell subtypes from these diseases. In summary, Scm6A is a novel, dependable, and accurate method for single-cell m6A detection and has broad applications in the realm of m6A-related research.
Hidden Links Between Skin Microbiome and Skin Imaging Phenome
Despite the skin microbiome has been linked to skin health and diseases, its role in modulating human skin appearance remains understudied. Using a total of 1244 face imaging phenomes and 246 cheek metagenomes, we first established three skin age indices by machine learning, including skin phenotype age (SPA), skin microbiota age (SMA), and skin integration age (SIA) as surrogates of phenotypic aging, microbial aging, and their combination, respectively. Moreover, we found that besides aging and gender as intrinsic factors, skin microbiome might also play a role in shaping skin imaging phenotypes (SIPs). Skin taxonomic and functional α diversity was positively linked to melanin, pore, pigment, and ultraviolet spot levels, but negatively linked to sebum, lightening, and porphyrin levels. Furthermore, certain species were correlated with specific SIPs, such as sebum and lightening levels negatively correlated with Corynebacterium matruchotii, Staphylococcus capitis, and Streptococcus sanguinis. Notably, we demonstrated skin microbial potential in predicting SIPs, among which the lightening level presented the least error of 1.8%. Lastly, we provided a reservoir of potential mechanisms through which skin microbiome adjusted the SIPs, including the modulation of pore, wrinkle, and sebum levels by cobalamin and heme synthesis pathways, predominantly driven by Cutibacterium acnes. This pioneering study unveils the paradigm for the hidden links between skin microbiome and skin imaging phenome, providing novel insights into how skin microbiome shapes skin appearance and its healthy aging.
SuperFeat: Quantitative Feature Learning from Single-cell RNA-seq Data Facilitates Drug Repurposing
In this study, we devised a computational framework called Supervised Feature Learning and Scoring (SuperFeat) which enables the training of a machine learning model and evaluates the canonical cellular statuses/features in pathological tissues that underlie the progression of disease. This framework also enables the identification of potential drugs that target the presumed detrimental cellular features. This framework was constructed on the basis of an artificial neural network with the gene expression profiles serving as input nodes. The training data comprised single-cell RNA sequencing datasets that encompassed the specific cell lineage during the developmental progression of cell features. A few models of the canonical cancer-involved cellular statuses/features were tested by such framework. Finally, we illustrated the drug repurposing pipeline, utilizing the training parameters derived from the adverse cellular statuses/features, which yielded successful validation results both in vitro and in vivo. SuperFeat is accessible at https://github.com/weilin-genomics/rSuperFeat.
Innovative Low-cost Probe Generation Empowers Targeted Long-read RNA Sequencing
EryDB: A Transcriptomic Profile Database for Erythropoiesis and Erythroid-related Diseases
Erythropoiesis is a finely regulated and complex process that involves multiple transformations from hematopoietic stem cells to mature red blood cells at hematopoietic sites from the embryonic to the adult stages. Investigations into its molecular mechanisms have generated a wealth of expression data, including bulk and single-cell RNA sequencing data. A comprehensively integrated and well-curated erythropoiesis-specific database will greatly facilitate the mining of gene expression data and enable large-scale research of erythropoiesis and erythroid-related diseases. Here, we present EryDB, an open-access and comprehensive database dedicated to the collection, integration, analysis, and visualization of transcriptomic data for erythropoiesis and erythroid-related diseases. Currently, the database includes expertly curated quality-assured data of 3803 samples and 1,187,119 single cells derived from 107 public studies of three species (Homo sapiens, Mus musculus, and Danio rerio), nine tissue types, and five diseases. EryDB provides users with the ability to not only browse the molecular features of erythropoiesis between tissues and species, but also perform computational analyses of single-cell and bulk RNA sequencing data, thus serving as a convenient platform for customized queries and analyses. EryDB v1.0 is freely accessible at https://ngdc.cncb.ac.cn/EryDB/home.
NeoTCR: an immunoinformatic database of experimentally-supported functional neoantigen-specific TCR sequences
Neoantigen-based immunotherapy has demonstrated long-lasting antitumor activity. The recognition of neoantigens by T cell receptors (TCRs) is considered a trigger for antitumor responses. Due to the overwhelming number of TCR repertoires in the human genome, pinpointing neoantigen-specific TCRs is a formidable challenge. Recent studies have identified a number of functional neoantigen-specific TCRs, but the corresponding information is scattered across published literature and is difficult to retrieve. To improve access to these data, we developed an immunoinformatic database (NeoTCR) containing a unified description of publicly available neoantigen-specific TCR sequences, as well as relevant information on targeted neoantigens, from experimentally supported studies across 18 cancer subtypes. A user-friendly web interface allows interactive browsing and running of complex database queries. To facilitate rapid identification of neoantigen-specific TCRs from raw sequencing data, NeoTCR offers a one-stop analysis for annotation and visualization of TCR clonotypes, discovery of existing neoantigen-specific TCRs, and exclusion of bystander viral-associated TCRs. NeoTCR represents a unique tool to expedite future studies of neoantigen-specific TCRs and the development of neoantigen-based immunotherapy. NeoTCR is available at http://neotcrdb.bioaimed.com/ and https://github.com/lyotvincent/NeoTCR.