GENOME RESEARCH

Gapless assembly of complete human and plant chromosomes using only nanopore sequencing
Koren S, Bao Z, Guarracino A, Ou S, Goodwin S, Jenike KM, Lucas J, McNulty B, Park J, Rautiainen M, Rhie A, Roelofs D, Schneiders H, Vrijenhoek I, Nijbroek K, Nordesjo O, Nurk S, Vella M, Lawrence KR, Ware D, Schatz MC, Garrison E, Huang S, McCombie WR, Miga KH, Wittenberg AHJ and Phillippy AM
The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, "telomere-to-telomere" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT "Duplex" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely studied genomes: human HG002, Heinz 1706 (tomato), and B73 (maize). For the diploid, heterozygous HG002 genome, we also used "Pore-C" chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the UL reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and provides a multirun single-instrument solution for the reconstruction of complete genomes.
Factors impacting target-enriched long-read sequencing of resistomes and mobilomes
Slizovskiy IB, Bonin N, Bravo JE, Ferm PM, Singer J, Boucher C and Noyes NR
We investigated the efficiency of target-enriched long-read sequencing (TELSeq) for detecting antimicrobial resistance genes (ARGs) and mobile genetic elements (MGEs) within complex matrices. We aimed to overcome limitations associated with traditional antimicrobial resistance (AMR) detection methods, including short-read shotgun metagenomics, which can lack sensitivity, specificity, and the ability to provide detailed genomic context. By combining biotinylated probe-based enrichment with long-read sequencing, we facilitated the amplification and sequencing of ARGs, eliminating the need for bioinformatic reconstruction. Our experimental design included replicates of human fecal microbiota transplant material, bovine feces, pristine prairie soil, and a mock human gut microbial community, allowing us to examine variables including genomic DNA input and probe set composition. Our findings demonstrated that TELSeq markedly improves the detection rates of ARGs and MGEs compared to traditional sequencing methods, underlining its potential for accurate AMR monitoring. A key insight from our research is the importance of incorporating mobilome profiles to better predict the transferability of ARGs within microbial communities, prompting a recommendation for the use of combined ARG-MGE probe sets for future studies. We also reveal limitations for ARG detection from low-input workflows, and describe the next steps for ongoing protocol refinement to minimize technical variability and expand utility in clinical and public health settings. This effort is part of our broader commitment to advancing methodologies that address the global challenge of AMR.
Multiple paralogues and recombination mechanisms contribute to the high incidence of 22q11.2 Deletion Syndrome
Vervoort L, Dierckxsens N, Sousa Santos M, Meynants S, Souche E, Cools R, Heung T, Devriendt K, Peeters H, McDonald-McGinn D, Swillen A, Breckpot J, Emanuel BS, Van Esch H, Bassett AS and Vermeesch JR
The 22q11.2 deletion syndrome (22q11.2DS) is the most common microdeletion disorder. Why the incidence of 22q11.2DS is much greater than that of other genomic disorders remains unknown. Short read sequencing cannot resolve the complex segmental duplicon structure to provide direct confirmation of the hypothesis that the rearrangements are caused by nonallelic homologous recombination between the low copy repeats on Chromosome 22 (LCR22s). To enable haplotype-specific assembly and rearrangement mapping in LCR22 regions, we combined fiber-FISH optical mapping with whole genome (ultra-)long read sequencing or rearrangement-specific long-range PCR on 24 duos (22q11.2DS patient and parent-of-origin) comprising several different LCR22-mediated rearrangements. Unexpectedly, we demonstrate that not only different paralogous segmental duplicon but also palindromic AT-rich repeats (PATRR) are driving 22q11.2 rearrangements. In addition, we show the existence of two different inversion polymorphisms preceding rearrangement, and somatic mosaicism. The existence of different recombination sites and mechanisms in paralogues and PATRRs which are copy number expanding in the human population are a likely explanation for the high 22q11.2DS incidence.
Construction and evaluation of a new rat reference genome assembly, GRCr8, from long reads and long-range scaffolding
Li K, Smith ML, Blazier JC, Kochan KJ, Wood JMD, Howe K, Kwitek AE, Dwinell MR, Chen H, Ciosek JL, Masterson P, Murphy TD, Kalbfleisch TS and Doris PA
We report the construction and analysis of a new reference genome assembly for , the laboratory rat, a widely used experimental animal model organism. The assembly has been adopted as the rat reference assembly by the Genome Reference Consortium and is named GRCr8. The assembly has employed 40× Pacific Biosciences (PacBio) HiFi sequencing coverage and scaffolding using optical mapping and Hi-C. We used genomic DNA from a male BN/NHsdMcwi (BN) rat of the same strain and from the same colony as the prior reference assembly, mRatBN7.2. The assembly is at chromosome level with 98.7% of the sequence assigned to chromosomes. All chromosomes have increased in size compared with the prior assembly and -mer analysis indicates that the subject animal is fully inbred and that the genome is represented as a single haploid assembly. Notable increases are observed in Chromosomes 3, 11, and 12 in the prospective rDNA regions. In addition, Chr Y has increased threefold in size and is more consistent with the rat karyotype than previous assemblies. Several other chromosomes have grown by the incorporation of sizable discrete new blocks. These contain highly repetitive sequences and encode numerous previously unannotated genes. In addition, centromeric sequences are incorporated in most chromosomes. Genome annotation has been performed by NCBI RefSeq, which confirms improvement in assembly quality and adds more than 1100 new protein coding genes. PacBio Iso-Seq data have been acquired from multiple tissues of the subject animal and are released concurrently with the new assembly to aid further analyses.
ISWI1 complex proteins facilitate developmental genome editing in
Singh A, Häußermann L, Emmerich C, Nischwitz E, Seah BK, Butter F, Nowacki M and Swart EC
One of the most extensive forms of natural genome editing occurs in ciliates, a group of microbial eukaryotes. Ciliate germline and somatic genomes are contained in distinct nuclei within the same cell. During the massive reorganization process of somatic genome development, ciliates eliminate tens of thousands of DNA sequences from a germline genome copy. Recently, we showed that the chromatin remodeler ISWI1 is required for somatic genome development in the ciliate Here, we describe two high similarity paralogous proteins, ICOPa and ICOPb, essential for their genome editing. ICOPa and ICOPb are highly divergent from known proteins; the only domain detected showed distant homology to the WSD (WHIM2+WHIM3) motif. We show that both ICOPa and ICOPb interact with the chromatin remodeler ISWI1. Upon ICOP knockdown, changes in alternative DNA excision boundaries and nucleosome densities are similar to those observed for knockdown. We thus propose that a complex comprising ISWI1 and either or both ICOPa and ICOPb are needed for precise genome editing.
Advancements in prospective single-cell lineage barcoding and their applications in research
Zhang X, Huang Y, Yang Y, Wang QE and Li L
Single-cell lineage tracing (scLT) has emerged as a powerful tool, providing unparalleled resolution to investigate cellular dynamics, fate determination, and the underlying molecular mechanisms. This review thoroughly examines the latest prospective lineage DNA barcode tracing technologies. It further highlights pivotal studies that leverage single-cell lentiviral integration barcoding technology to unravel the dynamic nature of cell lineages in both developmental biology and cancer research. Additionally, the review navigates through critical considerations for successful experimental design in lineage tracing and addresses challenges inherent in this field, including technical limitations, complexities in data analysis, and the imperative for standardization. It also outlines current gaps in knowledge and suggests future research directions, contributing to the ongoing advancement of scLT studies.
Haplotype-resolved genome and population genomics of the threatened garden dormouse in Europe
Byerly PA, von Thaden A, Leushkin E, Hilgers L, Liu S, Winter S, Schell T, Gerheim C, Ben Hamadou A, Greve C, Betz C, Bolz HJ, Büchner S, Lang J, Meinig H, Famira-Parcsetich EM, Stubbe SP, Mouton A, Bertolino S, Verbeylen G, Briner T, Freixas L, Vinciguerra L, Mueller SA, Nowak C and Hiller M
Genomic resources are important for evaluating genetic diversity and supporting conservation efforts. The garden dormouse () is a small rodent that has experienced one of the most severe modern population declines in Europe. We present a high-quality haplotype-resolved reference genome for the garden dormouse, and combine comprehensive short and long-read transcriptomics data sets with homology-based methods to generate a highly complete gene annotation. Demographic history analysis of the genome reveal a sharp population decline since the last interglacial, indicating an association between colder climates and population declines before anthropogenic influence. Using our genome and genetic data from 100 individuals, largely sampled in a citizen-science project across the contemporary range, we conduct the first population genomic analysis for this species. We find clear evidence for population structure across the species' core Central European range. Notably, our data show that the Alpine population, characterized by strong differentiation and reduced genetic diversity, is reproductively isolated from other regions and likely represents a differentiated evolutionary significant unit (ESU). The predominantly declining Eastern European populations also show signs of recent isolation, a pattern consistent with a range expansion from Western to Eastern Europe during the Holocene, leaving relict populations now facing local extinction. Overall, our findings suggest that garden dormouse conservation may be enhanced in Europe through the designation of ESUs.
High-quality sika deer omics data and integrative analysis reveal genic and cellular regulation of antler regeneration
Li Z, Xu Z, Zhu L, Qin T, Ma J, Feng Z, Yue H, Guan Q, Zhou B, Han G, Zhang G, Li C, Jia S, Qiu Q, Hao D, Wang Y and Wang W
Antler is the only organ that can fully regenerate annually in mammals. However, the regulatory pattern and mechanism of gene expression and cell differentiation during this process remain largely unknown. Here, we obtain comprehensive assembly and gene annotation of the sika deer () genome. Together with large-scale chromatin accessibility and gene expression data, we construct gene regulatory networks involved in antler regeneration, identifying four transcription factors, , , , and with high regulatory activity across whole regeneration process. Comparative studies and luciferase reporter assay suggest the expression driven by a cervid-specific regulatory element might be important for antler regenerative ability. We further develop a model called cTOP which integrates single-cell data with bulk regulatory networks and find , , , and as potential pivotal factors in antler stem cell activation and osteogenic differentiation. Additionally, we uncover interactions within and between cell programs and pathways during the regeneration process. These findings provide insights into the gene and cell regulatory mechanisms of antler regeneration, particularly in stem cell activation and differentiation.
Understanding isoform expression by pairing long-read sequencing with single-cell and spatial transcriptomics
Belchikov N, Hsu J, Li XJ, Jarroux J, Hu W, Joglekar A and Tilgner HU
RNA isoform diversity, produced via alternative splicing, and alternative usage of transcription start and poly(A) sites, results in varied transcripts being derived from the same gene. Distinct isoforms can play important biological roles, including by changing the sequences or expression levels of protein products. The first single-cell approaches to RNA sequencing-and later, spatial approaches-which are now widely used for the identification of differentially expressed genes, rely on short reads and offer the ability to transcriptomically compare different cell types but are limited in their ability to measure differential isoform expression. More recently, long-read sequencing methods have been combined with single-cell and spatial technologies in order to characterize isoform expression. In this review, we provide an overview of the emergence of single-cell and spatial long-read sequencing and discuss the challenges associated with the implementation of these technologies and interpretation of these data. We discuss the opportunities they offer for understanding the relationships between the distinct variable elements of transcript molecules and highlight some of the ways in which they have been used to characterize isoforms' roles in development and pathology. Single-nucleus long-read sequencing, a special case of the single-cell approach, is also discussed. We attempt to cover both the limitations of these technologies and their significant potential for expanding our still-limited understanding of the biological roles of RNA isoforms.
Modeling gene interactions in polygenic prediction via geometric deep learning
Li H, Zeng J, Snyder MP and Zhang S
Polygenic risk score (PRS) is a widely-used approach for predicting individuals' genetic risk of complex diseases, playing a pivotal role in advancing precision medicine. Traditional PRS methods, predominantly following a linear structure, often fall short in capturing the intricate relationships between genotype and phenotype. In this study, we present PRS-Net, an interpretable geometric deep learning-based framework that effectively models the nonlinearity of biological systems for enhanced disease prediction and biological discovery. PRS-Net begins by deconvoluting the genome-wide PRS at the single-gene resolution, and then explicitly encapsulates gene-gene interactions leveraging a graph neural network (GNN) for genetic risk prediction, enabling a systematic characterization of molecular interplay underpinning diseases. An attentive readout module is introduced to facilitate model interpretation. Extensive tests across multiple complex traits and diseases demonstrate the superior prediction performance of PRS-Net compared to conventional PRS methods. The interpretability of PRS-Net further enhances the identification of disease-relevant genes and gene programs. PRS-Net provides a potent tool for concurrent genetic risk prediction and biological discovery for complex diseases.
Challenges in identifying mRNA transcript starts and ends from long-read sequencing data
Calvo-Roitberg E, Daniels RF and Pai AA
Long-read sequencing (LRS) technologies have the potential to revolutionize scientific discoveries in RNA biology through the comprehensive identification and quantification of full-length mRNA isoforms. Despite great promise, challenges remain in the widespread implementation of LRS technologies for RNA-based applications, including concerns about low coverage, high sequencing error, and robust computational pipelines. Although much focus has been placed on defining mRNA exon composition and structure with LRS data, less careful characterization has been done of the ability to assess the terminal ends of isoforms, specifically, transcription start and end sites. Such characterization is crucial for completely delineating full mRNA molecules and regulatory consequences. However, there are substantial inconsistencies in both start and end coordinates of LRS reads spanning a gene, such that LRS reads often fail to accurately recapitulate annotated or empirically derived terminal ends of mRNA molecules. Here, we describe the specific challenges of identifying and quantifying mRNA terminal ends with LRS technologies and how these issues influence biological interpretations of LRS data. We then review recent experimental and computational advances designed to alleviate these problems, with ideal use cases for each approach. Finally, we outline anticipated developments and necessary improvements for the characterization of terminal ends from LRS data.
Leveraging the power of long reads for targeted sequencing
Iyer SV, Goodwin S and McCombie WR
Long-read sequencing technologies have improved the contiguity and, as a result, the quality of genome assemblies by generating reads long enough to span and resolve complex or repetitive regions of the genome. Several groups have shown the power of long reads in detecting thousands of genomic and epigenomic features that were previously missed by short-read sequencing approaches. While these studies demonstrate how long reads can help resolve repetitive and complex regions of the genome, they also highlight the throughput and coverage requirements needed to accurately resolve variant alleles across large populations using these platforms. At the time of this review, whole-genome long-read sequencing is more expensive than short-read sequencing on the highest throughput short-read instruments; thus, achieving sufficient coverage to detect low-frequency variants (such as somatic variation) in heterogenous samples remains challenging. Targeted sequencing, on the other hand, provides the depth necessary to detect these low-frequency variants in heterogeneous populations. Here, we review currently used and recently developed targeted sequencing strategies that leverage existing long-read technologies to increase the resolution with which we can look at nucleic acids in a variety of biological contexts.
The chromatin landscape of the histone-possessing bacteria
Marinov GK, Doughty B, Kundaje A and Greenleaf WJ
Histone proteins have traditionally been thought to be restricted to eukaryotes and most archaea, with eukaryotic nucleosomal histones deriving from their archaeal ancestors. In contrast, bacteria lack histones as a rule. However, histone proteins have recently been identified in a few bacterial clades, most notably the phylum Bdellovibrionota, and these histones have been proposed to exhibit a range of divergent features compared to histones in archaea and eukaryotes. However, no functional genomic studies of the properties of Bdellovibrionota chromatin have been carried out. In this work, we map the landscape of chromatin accessibility, active transcription and three-dimensional genome organization in a member of Bdellovibrionota (a strain). We find that, similar to what is observed in some archaea and in eukaryotes with compact genomes such as yeast, chromatin is characterized by preferential accessibility around promoter regions. Similar to eukaryotes, chromatin accessibility in positively correlates with gene expression. Mapping active transcription through single-strand DNA (ssDNA) profiling revealed that unlike in yeast, but similar to the state of mammalian and fly promoters, promoters exhibit very strong polymerase pausing. Finally, similar to that of other bacteria without histones, the genome exists in a three-dimensional (3D) configuration organized by the parABS system along the axis defined by replication origin and termination regions. These results provide a foundation for understanding the chromatin biology of the unique Bdellovibrionota bacteria and the functional diversity in chromatin organization across the tree of life.
Leveraging the T2T assembly to resolve rare and pathogenic inversions in reference genome gaps
Bilgrav Saether K, Eisfeldt J, Bengtsson JD, Lun MY, Grochowski CM, Mahmoud M, Chao HT, Rosenfeld JA, Liu P, Ek M, Schuy J, Ameur A, Dai H, , Hwang JP, Sedlazeck FJ, Bi W, Marom R, Wincent J, Nordgren A, Carvalho CMB and Lindstrand A
Chromosomal inversions (INVs) are particularly challenging to detect due to their copy-number neutral state and association with repetitive regions. Inversions represent about 1/20 of all balanced structural chromosome aberrations and can lead to disease by gene disruption or altering regulatory regions of dosage-sensitive genes in Short-read genome sequencing (srGS) can only resolve ∼70% of cytogenetically visible inversions referred to clinical diagnostic laboratories, likely due to breakpoints in repetitive regions. Here, we study 12 inversions by long-read genome sequencing (lrGS) ( = 9) or srGS ( = 3) and resolve nine of them. In four cases, the inversion breakpoint region was missing from at least one of the human reference genomes (GRCh37, GRCh38, T2T-CHM13) and a reference agnostic analysis was needed. One of these cases, an INV9 mappable only in de novo assembled lrGS data using T2T-CHM13 disrupts consistent with a Mendelian diagnosis (Kleefstra syndrome 1; MIM#610253). Next, by pairwise comparison between T2T-CHM13, GRCh37, and GRCh38, as well as the chimpanzee and bonobo, we show that hundreds of megabases of sequence are missing from at least one human reference, highlighting that primate genomes contribute to genomic diversity. Aligning population genomic data to these regions indicated that these regions are variable between individuals. Our analysis emphasizes that T2T-CHM13 is necessary to maximize the value of lrGS for optimal inversion detection in clinical diagnostics. These results highlight the importance of leveraging diverse and comprehensive reference genomes to resolve unsolved molecular cases in rare diseases.
Multisample motif discovery and visualization for tandem repeats
Zhang Y, Hulsman M, Salazar A, Tesi NO, Knoop L, van der Lee S, Wijesekera S, Krizova J, Kamsteeg EJ and Holstege H
Tandem Repeats (TR) occupy a significant portion of the human genome and are the source of polymorphism due to variations in sizes and motif compositions. Some of these variations have been associated with various neuropathological disorders, highlighting the clinical importance of assessing the motif structure of TRs. Moreover, assessing the TR motif variation can offer valuable insights into evolutionary dynamics and population structure. Previously, characterizations of TRs have been limited by short-read sequencing technology, which lacks the ability to accurately capture the full TR sequences. As long-read sequencing becomes more accessible and can capture the full complexity of TRs, there is now also a need for tools to characterize and analyze TRs using long-read data across multiple samples. In this study, we present MotifScope, a novel algorithm for characterization and visualization of TRs based on a de novo -mer approach for motif discovery. Comparative analysis against established tools reveals that MotifScope can identify a greater number of motifs and more accurately represent the underlying repeat sequence. Moreover, MotifScope has been specifically designed to enable motif composition comparisons across assemblies of different individuals, as well as across long-read sequencing reads within an individual, through combined motif discovery and sequence alignment. We showcase potential applications of MotifScope in diverse fields, including population genetics, clinical settings, and forensic analyses.
Resolving complex duplication variants in autism spectrum disorder using long-read genome sequencing
Eisfeldt J, Higginbotham EJ, Lenner F, Howe J, Fernandez BA, Lindstrand A, Scherer SW and Feuk L
Rare or de novo structural variation, primarily in the form of copy number variants, is detected in 5%-10% of autism spectrum disorder (ASD) families. While complex structural variants involving duplications can generally be detected using microarray or short-read genome sequencing (GS), these methods frequently fail to characterize breakpoints at nucleotide resolution, requiring additional molecular methods for validation and fine-mapping. Here, we use Oxford Nanopore Technologies PromethION long-read GS to characterize complex genomic rearrangements (CGRs) involving large duplications that segregate with ASD in five families. In total, we investigated 13 CGR carriers and were able to resolve all breakpoint junctions at nucleotide resolution. While all breakpoints were identified, the precise genomic architecture of one rearrangement remained unresolved with three different potential structures. The findings in two families include potential fusion genes formed through duplication rearrangements, involving and In two of the families originating from the same geographical region, an identical rearrangement involving was identified, which likely represents a founder variant. In addition, we analyze methylation status directly from the long-read data, allowing us to assess the activity of rearranged genes and regulatory regions. Investigation of methylation across the CGRs reveals aberrant methylation status in carriers across a rearrangement affecting the locus. In aggregate, our results demonstrate the utility of nanopore sequencing to pinpoint CGRs associated with ASD in five unrelated families, and highlight the importance of a gene-centric description of disease-associated complex chromosomal rearrangements.
KAS-ATAC reveals the genome-wide single-stranded accessible chromatin landscape of the human genome
Kim SH, Marinov GK and Greenleaf W
Gene regulation in most eukaryotes involves two fundamental physical processes -- alterations in the packaging of the genome by nucleosomes, with active -regulatory elements (CREs) generally characterized by an open-chromatin configuration, and the activation of transcription. Mapping these physical properties and biochemical activities genome-wide -- through profiling chromatin accessibility and active transcription -- are key tools used to understand the logic and mechanisms of transcription and its regulation. However, the relationship between these two states has until now not been accessible to simultaneous measurement. To address this, we developed KAS-ATAC, a combination of the KAS-seq (Kethoxal-Assisted SsDNA sequencing and ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) methods for mapping single-stranded DNA (and thus active transcription) and chromatin accessibility, respectively, enabling the genome-wide identification of DNA fragments that are simultaneously accessible and contain ssDNA. We use KAS-ATAC to evaluate levels of active transcription over different classes of regulatory elements in the human genome, to estimate the absolute levels of transcribed accessible DNA over CREs, to map the nucleosomal configurations associated with RNA polymerase activities, and to assess transcription factor association with transcribed DNA through transcription factor binding site (TFBS) footprinting. We observe lower levels of transcription over distal enhancers compared to promoters and distinct nucleosomal configurations around transcription initiation sites associated with active transcription. Most TFs associate equally with transcribed and nontranscribed DNA but a few factors specifically do not exhibit footprints over ssDNA-containing fragments. We anticipate KAS-ATAC to continue to derive useful insights into chromatin organization and transcriptional regulation in other contexts in the future.
A national long-read sequencing study on chromosomal rearrangements uncovers hidden complexities
Eisfeldt J, Ameur A, Lenner F, Ten Berk de Boer E, Ek M, Wincent J, Vaz R, Ottosson J, Jonson T, Ivarsson S, Thunström S, Topa A, Stenberg S, Rohlin A, Sandestig A, Nordling M, Palmebäck P, Burstedt M, Nordin F, Stattin EL, Sobol M, Baliakas P, Bondeson ML, Höijer I, Saether KB, Lovmar L, Ehrencrona H, Melin M, Feuk L and Lindstrand A
Clinical genetic laboratories often require a comprehensive analysis of chromosomal rearrangements/structural variants (SVs), from large events like translocations and inversions to supernumerary ring/marker chromosomes and small deletions or duplications. Understanding the complexity of these events and their clinical consequences requires pinpointing breakpoint junctions and resolving the derivative chromosome structure. This task often surpasses the capabilities of short-read sequencing technologies. In contrast, long-read sequencing techniques present a compelling alternative for clinical diagnostics. Here, Genomic Medicine Sweden-Rare Diseases has explored the utility of HiFi Revio long-read genome sequencing (lrGS) for digital karyotyping of SVs nationwide. The 16 samples from 13 families were collected from all Swedish healthcare regions. Prior investigations had identified 16 SVs, ranging from simple to complex rearrangements, including inversions, translocations, and copy number variants. We have established a national pipeline and a shared variant database for variant calling and filtering. Using lrGS, 14 of the 16 known SVs are detected. Of these, 13 are mapped at nucleotide resolution, and one complex rearrangement is only visible by read depth. Two Chromosome 21 rearrangements, one mosaic, remain undetected. Average read lengths are 8.3-18.8 kb with coverage exceeding 20× for all samples. De novo assembly results in a limited number of phased contigs per individual (N50 6-86 Mb), enabling direct characterization of the chromosomal rearrangements. In a national pilot study, we demonstrate the utility of HiFi Revio lrGS for analyzing chromosomal rearrangements. Based on our results, we propose a 5-year plan to expand lrGS use for rare disease diagnostics in Sweden.
Long-read RNA sequencing reveals allele-specific N6-methyladenosine modifications
Park D and Cenik C
Long-read sequencing technology enables highly accurate detection of allele-specific RNA expression, providing insights into the effects of genetic variation on splicing and RNA abundance. Furthermore, the ability to directly sequence RNA using the Oxford Nanopore technology promises the detection of RNA modifications in tandem with ascertaining the allelic origin of each molecule. Here, we leverage these advantages to determine allele-biased patterns of N6-methyladenosine (m6A) modifications in native mRNA. We utilized human and mouse cells with known genetic variants to assign allelic origin of each mRNA molecule combined with a supervised machine learning model to detect read-level m6A modification ratios. Our analyses revealed the importance of sequences adjacent to the DRACH-motif in determining m6A deposition, in addition to allelic differences that directly alter the motif. Moreover, we discovered allele-specific m6A modification (ASM) events with no genetic variants in close proximity to the differentially modified nucleotide, demonstrating the unique advantage of using long reads and surpassing the capabilities of antibody-based short-read approaches. This technological advancement promises to advance our understanding of the role of genetics in determining mRNA modifications.
Geometric deep learning framework for de novo genome assembly
Vrček L, Bresson X, Laurent T, Schmitz M, Kawaguchi K and Šikić M
The critical stage of every de novo genome assembler is identifying paths in assembly graphs that correspond to the reconstructed genomic sequences. The existing algorithmic methods struggle with this, primarily due to repetitive regions causing complex graph tangles, leading to fragmented assemblies. Here, we introduce GNNome, a framework for path identification based on geometric deep learning that enables training models on assembly graphs without relying on existing assembly strategies. By leveraging only the symmetries inherent to the problem, GNNome reconstructs assemblies from PacBio HiFi reads with contiguity and quality comparable to those of the state-of-the-art tools across several species. With every new genome assembled telomere-to-telomere, the amount of reliable training data at our disposal increases. Combining the straightforward generation of abundant simulated data for diverse genomic structures with the AI approach makes the proposed framework a plausible cornerstone for future work on reconstructing complex genomes with different ploidy and aneuploidy degrees. To facilitate such developments, we make the framework and the best-performing model publicly available, provided as a tool that can directly be used to assemble new haploid genomes.
The chromatin tapestry as a framework for neurodevelopment
Nolan B, Reznicek TE, Cummings CT and Rowley MJ
The neuronal nucleus houses a meticulously organized genome. Within this structure, genetic material is not simply compacted but arranged into a precise and functional 3D chromatin landscape essential for cellular regulation. This mini-review highlights the importance of this chromatin landscape in healthy neurodevelopment, as well as the diseases that occur with aberrant chromatin architecture. We discuss insights into the fundamental mechanistic relationship between histone modifications, DNA methylation, and genome organization. We then discuss findings that reveal how these epigenetic features change throughout normal neurodevelopment. Finally, we highlight single-gene neurodevelopmental disorders that illustrate the interdependence of epigenetic features, showing how disruptions in DNA methylation or genome architecture can ripple across the entire epigenome. As such, we emphasize the importance of measuring multiple chromatin architectural aspects, as the disruption of one mechanism can likely impact others in the intricate epigenetic network. This mini-review underscores the vast gaps in our understanding of chromatin structure in neurodevelopmental diseases and the substantial research needed to understand the interplay between chromatin features and neurodevelopment.