Molecular Ecology Resources

Correction to "Characterisation of Putative Circular Plasmids in Sponge-Associated Bacterial Communities Using a Selective Multiply-Primed Rolling Circle Amplification"
Probe Capture Enrichment Sequencing of amoA Genes Improves the Detection of Diverse Ammonia-Oxidising Archaeal and Bacterial Populations
Hiraoka S, Ijichi M, Takeshima H, Kumagai Y, Yang CC, Makabe-Kobayashi Y, Fukuda H, Yoshizawa S, Iwasaki W, Kogure K and Shiozaki T
The ammonia monooxygenase subunit A (amoA) gene has been used to investigate the phylogenetic diversity, spatial distribution and activity of ammonia-oxidising archaeal (AOA) and bacterial (AOB), which contribute significantly to the nitrogen cycle in various ecosystems. Amplicon sequencing of amoA is a widely used method; however, it produces inaccurate results owing to the lack of a 'universal' primer set. Moreover, currently available primer sets suffer from amplification biases, which can lead to severe misinterpretation. Although shotgun metagenomic and metatranscriptomic analyses are alternative approaches without amplification bias, the low abundance of target genes in heterogeneous environmental DNA restricts a comprehensive analysis to a realisable sequencing depth. In this study, we developed a probe set and bioinformatics workflow for amoA enrichment sequencing using a hybridisation capture technique. Using metagenomic mock community samples, our approach effectively enriched amoA genes with low compositional changes, outperforming amplification and meta-omics sequencing analyses. Following the analysis of metatranscriptomic marine samples, we predicted 80 operational taxonomic units (OTUs) assigned to either AOA or AOB, of which 30 OTUs were unidentified using simple metatranscriptomic or amoA gene amplicon sequencing. Mapped read ratios to all the detected OTUs were significantly higher for the capture samples (50.4 ± 27.2%) than for non-capture samples (0.05 ± 0.02%), demonstrating the high enrichment efficiency of the method. The analysis also revealed the spatial diversity of AOA ecotypes with high sensitivity and phylogenetic resolution, which are difficult to examine using conventional approaches.
OGU: A Toolbox for Better Utilising Organelle Genomic Data
Wu P, Xue N, Yang J, Zhang Q, Sun Y and Zhang W
Organelle genomes serve as crucial datasets for investigating the genetics and evolution of plants and animals, genome diversity, and species identification. To enhance the collection, analysis, and visualisation of such data, we have developed a novel open-source software tool named Organelle Genome Utilities (OGU). The software encompasses three modules designed to streamline the handling of organelle genome data. The data collection module is dedicated to retrieving, validating and organising sequence information. The evaluation module assesses sequence variance using a range of methods, including novel metrics termed stem and terminal phylogenetic diversity. The primer module designs universal primers for downstream applications. Finally, a visualisation pipeline has been developed to present comprehensive insights into organelle genomes across different lineages rather than focusing solely on individual species. The performance, compatibility and stability of OGU have been rigorously evaluated through benchmarking with four datasets, including one million mixed GenBank records, plastid genomic data from the Lamiaceae family, mitochondrial data from rodents, and 308 plastid genomes sourced from various angiosperm families. Based on software capabilities, we identified 30 plastid intergenic spacers. These spacers exhibit a moderate evolutionary rate and offer practical utility comparable to coding regions, highlighting the potential applications of intergenic spacers in organelle genomes. We anticipate that OGU will substantially enhance the efficient utilisation of organelle genomic data and broaden the prospects for related research endeavours.
HMicroDB: A Comprehensive Database of Herpetofaunal Microbiota With a Focus on Host Phylogeny, Physiological Traits, and Environment Factors
Li J, Gao Y, Shu G, Chen X, Zhu J, Zheng S and Chen T
Symbiotic microbiota strongly impact host physiology. Amphibians and reptiles occupy a pivotal role in the evolutionary history of Animalia, and they are of significant ecological, economic, and scientific value. Many prior studies have found that symbiotic microbiota in herpetofaunal species are closely associated with host phylogeny, physiological traits, and environmental factors; however, insufficient integrated databases hinder researchers from querying, accessing, and reanalyzing these resources. To rectify this, we built the first herpetofaunal microbiota database (HMicroDB; https://herpdb.com/) that integrates 11,697 microbiological samples from 337 host species (covering 23 body sites and associated with 23 host phenotypic or environmental factors), and we identified 11,084 microbial taxa by consistent annotation. The standardised analysis process, cross-dataset integration, user-friendly interface, and interactive visualisation make the HMicroDB a powerful resource for researchers to search, browse, and explore the relationships between symbiotic microbiota, hosts, and environment. This facilitates research in host-microbiota coevolution, biological conservation, and resource utilisation.
Development of SNP Panels from Low-Coverage Whole Genome Sequencing (lcWGS) to Support Indigenous Fisheries for Three Salmonid Species in Northern Canada
Beemelmanns A, Bouchard R, Michaelides S, Normandeau E, Jeon HB, Chamlian B, Babin C, Hénault P, Perrot O, Harris LN, Zhu X, Fraser D, Bernatchez L and Moore JS
Single nucleotide polymorphism (SNP) panels are powerful tools for assessing the genetic population structure and dispersal of fishes and can enhance management practices for commercial, recreational and subsistence mixed-stock fisheries. Arctic Char (Salvelinus alpinus), Brook Trout (Salvelinus fontinalis) and Lake Whitefish (Coregonus clupeaformis) are among the most harvested and consumed fish species in Northern Indigenous communities in Canada, contributing significantly to food security, culture, tradition and economy. However, genetic resources supporting Indigenous fisheries have not been widely accessible to northern communities (e.g. Inuit, Cree, Dene). Here, we developed Genotyping-in-Thousands by sequencing (GT-seq) panels for population assignment and mixed-stock analyses of three salmonids, to support fisheries stewardship or co-management in Northern Canada. Using low-coverage Whole Genome Sequencing data from 418 individuals across source populations in Cambridge Bay (Nunavut), Great Slave Lake (Northwest Territories), James Bay (Québec) and Mistassini Lake (Québec), we developed a bioinformatic SNP filtering workflow to select informative SNP markers from genotype likelihoods. These markers were then used to design GT-seq panels, thus enabling high-throughput genotyping for these species. The three GT-seq panels yielded an average of 413 autosomal loci and were validated using 525 individuals with an average assignment accuracy of 83%. Thus, these GT-seq panels are powerful tools for assessing population structure and quantifying the relative contributions of populations/stocks in mixed-stock fisheries across multiple regions. Interweaving genomic data derived from these tools with Traditional Ecological Knowledge will ensure the sustainable harvest of three culturally important salmonids in Indigenous communities, contributing to food security programmes and the economy in Northern Canada.
Chromosomal-Level Genome Suggests Adaptive Constraints Leading to the Historical Population Decline in an Extremely Endangered Plant
Shao S, Li Y, Feng X, Jin C, Liu M, Zhu R, Tracy ME, Guo Z, He Z, Shi S and Xu S
Increased human activity and climate change have significantly impacted wild habitats and increased the number of endangered species. Exploring evolutionary history and predicting adaptive potential using genomic data will facilitate species conservation and biodiversity recovery. Here, we examined the genome evolution of a critically endangered tree Pellacalyx yunnanensis, a plant species with extremely small populations (PSESP) that is narrowly distributed in Xishuangbanna, China. The species has neared extinction due to economic exploitation in recent decades. We assembled a chromosome-level genome of 334 Mb, with the N50 length of 20.5 Mb. Using the genome, we discovered that P. yunnanensis has undergone several population size reductions, leading to excess deleterious mutations. The species may possess low adaptive potential due to reduced genetic diversity and the loss of stress-responsive genes. We estimate that P. yunnanensis is the basal species of its genus and diverged from its relatives during global cooling, suggesting it was stranded in unsuitable environments during periods of dramatic climate change. In particular, the loss of seed dormancy leads to germination under unfavourable conditions and reproduction challenges. This dormancy loss may have occurred through genetic changes that suppress ABA signalling and the loss of genes involved in seed maturation. The high-quality genome has also enabled us to reveal phenotypic trait evolution in Rhizophoraceae and identify divergent adaptation to intertidal and inland habitats. In summary, our study elucidates mechanisms underlying the decline and evaluates the adaptive potential of P. yunnanensis to future climate change, informing future conservation efforts.
Revisiting the Briggs Ancient DNA Damage Model: A Fast Maximum Likelihood Method to Estimate Post-Mortem Damage
Zhao L, Henriksen RA, Ramsøe A, Nielsen R and Korneliussen TS
One essential initial step in the analysis of ancient DNA is to authenticate that the DNA sequencing reads are actually from ancient DNA. This is done by assessing if the reads exhibit typical characteristics of post-mortem damage (PMD), including cytosine deamination and nicks. We present a novel statistical method implemented in a fast multithreaded programme, ngsBriggs that enables rapid quantification of PMD by estimation of the Briggs ancient damage model parameters (Briggs parameters). Using a multinomial model with maximum likelihood fit, ngsBriggs accurately estimates the parameters of the Briggs model, quantifying the PMD signal from single and double-stranded DNA regions. We extend the original Briggs model to capture PMD signals for contemporary sequencing platforms and show that ngsBriggs accurately estimates the Briggs parameters across a variety of contamination levels. Classification of reads into ancient or modern reads, for the purpose of decontamination, is significantly more accurate using ngsBriggs than using other methods available. Furthermore, ngsBriggs is substantially faster than other state-of-the-art methods. ngsBriggs offers a practical and accurate method for researchers seeking to authenticate ancient DNA and improve the quality of their data.
Sediment Core DNA-Metabarcoding and Chitinous Remain Identification: Integrating Complementary Methods to Characterise Chironomidae Biodiversity in Lake Sediment Archives
Blattner LA, Lapellegerie P, Courtney-Mustaphi C and Heiri O
Chironomidae, so-called non-biting midges, are considered key bioindicators of aquatic ecosystem variability. Data derived from morphologically identifying their chitinous remains in sediments document chironomid larvae assemblages, which are studied to reconstruct ecosystem changes over time. Recent developments in sedimentary DNA (sedDNA) research have demonstrated that molecular techniques are suitable for determining past and present occurrences of organisms. Nevertheless, sedDNA records documenting alterations in chironomid assemblages remain largely unexplored. To close this gap, we examined the applicability of sedDNA metabarcoding to identify Chironomidae assemblages in lake sediments by sampling and processing three 21-35 cm long sediment cores from Lake Sempach in Switzerland. With a focus on developing analytical approaches, we compared an invertebrate-universal (FWH) and a newly designed Chironomidae-specific metabarcoding primer set (CH) to assess their performance in detecting Chironomidae DNA. We isolated and identified chitinous larval remains and compared the morphotype assemblages with the data derived from sedDNA metabarcoding. Results showed a good overall agreement of the morphotype assemblage-specific clustering among the chitinous remains and the metabarcoding datasets. Both methods indicated higher chironomid assemblage similarity between the two littoral cores in contrast to the deep lake core. Moreover, we observed a pronounced primer bias effect resulting in more Chironomidae detections with the CH primer combination compared to the FWH combination. Overall, we conclude that sedDNA metabarcoding can supplement traditional remain identifications and potentially provide independent reconstructions of past chironomid assemblage changes. Furthermore, it has the potential of more efficient workflows, better sample standardisation and species-level resolution datasets.
Genotyping Error Detection and Customised Filtration for SNP Datasets
Kan-Lingwood NY, Sagi L, Mazie S, Shahar N, Zecherle Bitton L, Templeton A, Rubenstein D, Bouskila A and Bar-David S
A major challenge in analysing single-nucleotide polymorphism (SNP) genotype datasets is detecting and filtering errors that bias analyses and misinterpret ecological and evolutionary processes. Here, we present a comprehensive method to estimate and minimise genotyping error rates (deviations from the 'true' genotype) in any SNP datasets using triplicates (three repeats of the same sample) in a four-step filtration pipeline. The approach involves: (1) SNP filtering by missing data; (2) SNP filtering by error rates; (3) sample filtering by missing data and (4) detection of recaptured individuals by using estimated SNP error rates. The modular pipeline is provided in an R script that allows customised adjustments. We demonstrate the applicability of the method using non-invasive sampling from the Asiatic wild ass (Equus hemionus) population in Israel. We genotyped 756 samples using 625 SNPs, of which 255 were triplicates of 85 samples. The average SNP error rate, calculated based on the number of mismatching genotypes across triplicates before filtration, was 0.0034 and was reduced to 0.00174 following filtration. Evaluating genetic distance (GD) and relatedness (r) between triplicates before and after filtration (expected to be at the minimum and maximum respectively) showed a significant reduction in the average GD, from 58.1 to 25.3 (p = 0.0002) and a significant increase in relatedness, from r = 0.98 to r = 0.991 (p = 0.00587). We demonstrate how error rate estimation enhances recapture detection and improves genotype quality.
Three Novel Spider Genomes Unveil Spidroin Diversification and Hox Cluster Architecture: Ryuthela nishihirai (Liphistiidae), Uloborus plumipes (Uloboridae) and Cheiracanthium punctorium (Cheiracanthiidae)
Schöneberg Y, Audisio TL, Ben Hamadou A, Forman M, Král J, Kořínková T, Líznarová E, Mayer C, Prokopcová L, Krehenwinkel H, Prost S and Kennedy S
Spiders are a hyperdiverse taxon and among the most abundant predators in nearly all terrestrial habitats. Their success is often attributed to key developments in their evolution such as silk and venom production and major apomorphies such as a whole-genome duplication. Resolving deep relationships within the spider tree of life has been historically challenging, making it difficult to measure the relative importance of these novelties for spider evolution. Whole-genome data offer an essential resource in these efforts, but also for functional genomic studies. Here, we present de novo assemblies for three spider species: Ryuthela nishihirai (Liphistiidae), a representative of the ancient Mesothelae, the suborder that is sister to all other extant spiders; Uloborus plumipes (Uloboridae), a cribellate orbweaver whose phylogenetic placement is especially challenging; and Cheiracanthium punctorium (Cheiracanthiidae), which represents only the second family to be sequenced in the hyperdiverse Dionycha clade. These genomes fill critical gaps in the spider tree of life. Using these novel genomes along with 25 previously published ones, we examine the evolutionary history of spidroin gene and structural hox cluster diversity. Our assemblies provide critical genomic resources to facilitate deeper investigations into spider evolution. The near chromosome-level genome of the 'living fossil' R. nishihirai represents an especially important step forward, offering new insights into the origins of spider traits.
Barcode 100K Specimens: In a Single Nanopore Run
Hebert PDN, Floyd R, Jafarpour S and Prosser SWJ
It is a global priority to better manage the biosphere, but action must be informed by comprehensive data on the abundance and distribution of species. The acquisition of such information is currently constrained by high costs. DNA barcoding can speed the registration of unknown animal species, the most diverse kingdom of eukaryotes, as the BIN system automates their recognition. However, inexpensive sequencing protocols are critical as the census of all animal species is likely to require the analysis of a billion or more specimens. Barcoding involves DNA extraction followed by PCR and sequencing with the last step dominating costs until 2017. By enabling the sequencing of highly multiplexed samples, the Sequel platforms from Pacific BioSciences slashed costs by 90%, but these instruments are only deployed in core facilities because of their expense. Sequencers from Oxford Nanopore Technologies provide an escape from high capital and service costs, but their low sequence fidelity has, until recently, constrained adoption. However, the improved performance of its latest flow cells (R10.4.1) erases this barrier. This study demonstrates that a MinION flow cell can characterise an amplicon pool derived from 100,000 specimens while a Flongle flow cell can process one derived from several thousand. At $0.01 per specimen, DNA sequencing is now the least expensive step in the barcode workflow.
What can optimized cost distances based on genetic distances offer? A simulation study on the use and misuse of ResistanceGA
Daniel A, Savary P, Foltête JC, Vuidel G, Faivre B, Garnier S and Khimoun A
Modelling population connectivity is central to biodiversity conservation and often relies on resistance surfaces reflecting multi-generational gene flow. ResistanceGA (RGA) is a common optimization framework for parameterizing these surfaces by maximizing the fit between genetic distances and cost distances using maximum likelihood population effect models. As the reliability of this framework has rarely been studied, we investigated the conditions maximizing its accuracy for both prediction and interpretation of landscape features' permeability. We ran demo-genetic simulations in contrasted landscapes for species with distinct dispersal capacities and specialization levels, using corresponding reference cost scenarios. We then optimized resistance surfaces from the simulated genetic distances using RGA. First, we evaluated whether RGA identified the drivers of the genetic patterns, that is, distinguished Isolation-by-Resistance (IBR) patterns from either Isolation-by-Distance or patterns unrelated to ecological distances. We then assessed RGA predictive performance using a cross-validation method, and its ability to recover the reference cost scenarios shaping genetic structure in simulations. IBR patterns were well detected and genetic distances were predicted with great accuracy. This performance depended on the strength of the genetic structuring, sampling design and landscape structure. Matching the scale of the genetic pattern by focusing on population pairs connected through gene flow and limiting overfitting through cross-validation further enhanced inference reliability. Yet, the optimized cost values often departed from the reference values, making their interpretation and extrapolation potentially dubious. While demonstrating the value of RGA for predictive modelling, we call for caution and provide additional guidance for its optimal use.
That's Not a Hybrid: How to Distinguish Patterns of Admixture and Isolation By Distance
Wiens BJ and Colella JP
Describing naturally occurring genetic variation is a fundamental goal of molecular phylogeography and population genetics. Popular methods for this task include STRUCTURE, a model-based algorithm that assigns individuals to genetic clusters, and principal component analysis (PCA), a parameter-free method. The ability of STRUCTURE to infer mixed ancestry makes it popular for documenting natural hybridisation, which is of considerable interest to evolutionary biologists, given that such systems provide a window into the speciation process. Yet, STRUCTURE can produce misleading results when its underlying assumptions are violated, like when genetic variation is distributed continuously across geographic space. To test the ability of STRUCTURE and PCA to accurately distinguish admixture from continuous variation, we use forward-time simulations to generate population genetic data under three demographic scenarios: two involving admixture and one with isolation by distance (IBD). STRUCTURE and PCA alone cannot distinguish admixture from IBD, but complementing these analyses with triangle plots, which visualise hybrid index against interclass heterozygosity, provides more accurate inference of demographic history, especially in cases of recent admixture. We demonstrate that triangle plots are robust to missing data, while STRUCTURE and PCA are not, and show that setting a low allele frequency difference threshold for ancestry-informative marker (AIM) identification can accurately characterise the relationship between hybrid index and interclass heterozygosity across demographic histories of admixture and range expansion. While STRUCTURE and PCA provide useful summaries of genetic variation, results should be paired with triangle plots before admixture is inferred.
The Chromosome-Scale Genome of Magnolia sieboldii K. Koch Provides Insight Into the Evolutionary Position of Magnoliids and Seed Germination
Lu X, Mei M, Liu L, Xu X and Ai W
Magnolia sieboldii K. Koch (M. sieboldii) stands as an elegant tree species within the Magnoliaceae family, esteemed for its exquisite beauty, cultural significance and economic advantages. The species faces challenges in seed germination under natural conditions, primarily attributed to morphological dormancy. Despite its significance, the molecular mechanisms governing M. sieboldii seed germination remain elusive, compounded by the absence of genomic resources specific to this species. In this study, we present the first chromosome-scale genome assembly of M. sieboldii, with a total genome size of 2.01 Gb, including 1096 scaffolds assigned to 19 chromosomes (N50 = 102.4 Mb). Phylogenetic analyses, incorporating 13 plant species, illuminate the evolutionary independence of Magnoliids from monocots and eudicots, positioning them as a sister clade. Through RNA-seq analysis, we identify pivotal genes and pathways contributing to seed dormancy and germination. In addition, our investigation delves into the the far-red-impaired response (FAR1) transcription factor gene family, revealing their enrichment throughout evolution and their involvement in the intricate process of seed germination. This comprehensive genome sequencing initiative offers invaluable insights into the biological attributes of M. sieboldii, with a specific emphasis on unravelling the complexities of seed dormancy and germination.
The Ribosomal Operon Database: A Full-Length rDNA Operon Database Derived From Genome Assemblies
Krabberød AK, Stokke E, Thoen E, Skrede I and Kauserud H
Current rDNA reference sequence databases are tailored towards shorter DNA markers, such as parts of the 16/18S marker or the internally transcribed spacer (ITS) region. However, due to advances in long-read DNA sequencing technologies, longer stretches of the rDNA operon are increasingly used in environmental sequencing studies to increase the phylogenetic resolution. There is, therefore, a growing need for longer rDNA reference sequences. Here, we present the ribosomal operon database (ROD), which includes eukaryotic full-length rDNA operons fished from publicly available genome assemblies. Full-length operons were detected in 34.1% of the 34,701 examined eukaryotic genome assemblies from NCBI. In most cases (53.1%), more than one operon variant was detected, which can be due to intragenomic operon copy variability, allelic variation in non-haploid genomes, or technical errors from the sequencing and assembly process. The highest copy number found was 5947 in Zea mays. In total, 453,697 unique operons were detected, with 69,480 operon variant clusters remaining after intragenomic clustering at 99% sequence identity. The operon length varied extensively across eukaryotes, ranging from 4136 to 16,463 bp, which will lead to considerable polymerase chain reaction (PCR) bias during amplification of the entire operon. Clustering the full-length operons revealed that the different parts (i.e., 18S, 28S, and the hypervariable regions V4 and V9 of 18S) provide divergent taxonomic resolution, with 18S, the V4 and V9 regions being the most conserved. The ROD will be updated regularly to provide an increasing number of full-length rDNA operons to the scientific community.
Correction to "A dedicated target capture approach reveals variable genetic markers across micro-and macro-evolutionary time scales in palms"
A Snakemake Toolkit for the Batch Assembly, Annotation and Phylogenetic Analysis of Mitochondrial Genomes and Ribosomal Genes From Genome Skims of Museum Collections
White OW, Hall A, Price BW, Williams ST and Clark MD
Low coverage 'genome-skims' are often used to assemble organelle genomes and ribosomal gene sequences for cost-effective phylogenetic and barcoding studies. Natural history collections hold invaluable biological information, yet poor preservation resulting in degraded DNA often hinders polymerase chain reaction-based analyses. However, it is possible to generate libraries and sequence the short fragments typical of degraded DNA to generate genome-skims from museum collections. Here we introduce a snakemake toolkit comprised of three pipelines skim2mito, skim2rrna and gene2phylo, designed to unlock the genomic potential of historical museum specimens using genome skimming. Specifically, skim2mito and skim2rrna perform the batch assembly, annotation and phylogenetic analysis of mitochondrial genomes and nuclear ribosomal genes, respectively, from low-coverage genome skims. The third pipeline gene2phylo takes a set of gene alignments and performs phylogenetic analysis of individual genes, partitioned analysis of concatenated alignments and a phylogenetic analysis based on gene trees. We benchmark our pipelines with simulated data, followed by testing with a novel genome skimming dataset from both recent and historical solariellid gastropod samples. We show that the toolkit can recover mitochondrial and ribosomal genes from poorly preserved museum specimens of the gastropod family Solariellidae, and the phylogenetic analysis is consistent with our current understanding of taxonomic relationships. The generation of bioinformatic pipelines that facilitate processing large quantities of sequence data from the vast repository of specimens held in natural history museum collections will greatly aid species discovery and exploration of biodiversity over time, ultimately aiding conservation efforts in the face of a changing planet.
Assembly of Mitochondrial Genomes Using Nanopore Long-Read Technology in Three Sea Chubs (Teleostei: Kyphosidae)
Baeza JA, Minish JJ and Michael TP
Complete mitochondrial genomes have become markers of choice to explore phylogenetic relationships at multiple taxonomic levels and they are often assembled using whole genome short-read sequencing. Herein, using three species of sea chubs as an example, we explored the accuracy of mitochondrial chromosomes assembled using Oxford Nanopore Technology (ONT) Kit 14 R10.4.1 long reads at different sequencing depths (high, low and very low or genome skimming) by comparing them to 'gold' standard reference mitochondrial genomes assembled using Illumina NovaSeq short reads. In two species of sea chubs, Girella nigricans and Kyphosus azureus, ONT long-read assembled mitochondrial genomes at high sequencing depths (> 25× whole [nuclear] genome) were identical to their respective short-read assembled mitochondrial genomes. Not a single 'homopolymer insertion', 'homopolymer deletion', 'simple substitution', 'single insertion', 'short insertion', 'single deletion' or 'short deletion' were detected in the long-read assembled mitochondrial genomes after aligning each one of them to their short-read counterparts. In turn, in a third species, Medialuna californiensis, a 25× sequencing depth long-read assembled mitochondrial genome was 14 nucleotides longer than its short-read counterpart. The difference in total length between the latter two assemblies was due to the presence of a short motif 14 bp long that was repeated (twice) in the long read but not in the short-read assembly. Read subsampling at a sequencing depth of 1× resulted in the assembly of partial or complete mitochondrial genomes with numerous errors, including, among others, simple indels, and indels at homopolymer regions. At 3× and 5× subsampling, genomes were identical (perfect) or almost identical (quasiperfect, 99.5% over 16,500 bp) to their respective Illumina assemblies. The newly assembled mitochondrial genomes exhibit identical gene composition and organisation compared with cofamilial species and a phylomitogenomic analysis based on translated protein-coding genes suggested that the family Kyphosidae is not monophyletic. The same analysis detected possible cases of misidentification of mitochondrial genomes deposited in GenBank. This study demonstrates that perfect (complete and fully accurate) or quasiperfect (complete but with a single or a very few errors) mitochondrial genomes can be assembled at high (> 25×) and low (3-5×) but not very low (1×, genome skimming) sequencing depths using ONT long reads and the latest ONT chemistries (Kit 14 and R10.4.1 flowcells with SUP basecalling). The newly assembled and annotated mitochondrial genomes can be used as a reference in environmental DNA studies focusing on bioprospecting and biomonitoring of these and other coastal species experiencing environmental insult. Given the small size of the sequencing device and low cost, we argue that ONT technology has the potential to improve access to high-throughput sequencing technologies in low- and moderate-income countries.
Correcting for Replicated Genotypes May Introduce More Problems Than it Solves
Meirmans PG
Across the tree of life, many organisms are able to reproduce clonally, via vegetative spread, budding or parthenogenesis. In population genetic analyses of clonally reproducing organisms, it is common practice to retain only a single representative per multilocus genotype. Though this practice of clone correction is widespread, the theoretical justification behind it has been very little studied. Here, I use individual-based simulations to study the effect of clone correction on the estimation of the genetic summary statistics H, H, F, F, F'' and D. The simulations follow the standard finite island model, consisting of a set of populations connected by gene flow, but with a variable rate of sexual versus asexual reproduction. The results of the simulations show that by itself, the inclusion of replicated genotypes does not lead to a deviation in the values of the summary statistics, except when the rate of sexual reproduction is less than about one in thousand. However, clone correction can introduce a strong deviation in the values of most of the statistics, when compared to a scenario of full sexual reproduction. For H and F, this deviation can be informative about the process of asexual reproduction, but for F, F'' and D, clone correction can lead to incorrect conclusions. I therefore argue that clone correction is not strictly necessary, but can in some cases be insightful. However, when clone correction is applied, it is imperative that results for both the corrected and uncorrected data are presented.
EVE-X: Software to Identify Novel Viral Insertions in Wild-Caught Arthropod Hosts From Next-Generation Short Read Data
Havill J, Strasburg O, Udoh T, Crawford JE and Gloria-Soria A
Eukaryotic genomes harbour sequences derived from non-retroviral RNA viruses, known as endogenous viral elements (EVEs) or non-retroviral integrated RNA virus sequences (NIRVS). These sequences represent a record of past infections and have been implicated in host anti-viral response. We have created a program to identify viral sequences integrated in a host genome. It begins with a specimen BAM file and outputs candidate NIRVS, along with putative host insertion sites and overlapping genomic features of the host genome in XML and visual formats, with minimal intermediary intervention. We ran through this software short-read data derived from the genomes of 222 wild-caught A. aegypti mosquitoes, from a dozen geographical regions, and located putative NIRVS from seven virus families. This program is as accurate as currently available software for NIRVS detection, and represents a significant improvement in adaptability and user-friendliness. Furthermore, the flexibility of this pipeline allows the user to search for sequence integrations across the genome of any organism, as long as a query sequence database and a reference genome is provided. Potential extended applications include identification of integrated transgenic sequences used for research or vector control strategies.
Detecting Assembly Errors With Klumpy: Building Confidence in Your Daily Genomic Analysis
Tsai IJ
In the realm of genome assembly, even minor errors can send researchers down to rabbit holes of unintended misinterpretation. Enter Klumpy-a tool designed to help detecting these elusive mistakes before they cause significant problems. By providing detailed, region-specific assessments and an intuitive visualisation platform, Klumpy (Madrigal, et al. 2024) empowers researchers to pinpoint and resolve potential issues with precision, paving the way for more reliable downstream analyses and discoveries.