Data processing for site visitors
FAQ: How data is processed - for site visitors
What data does eDNA Explorer process?
eDNA Explorer users upload their raw, unprocessed DNA sequence data and metadata which we analyze with Tronko, our data processing tool (described further below). Our aim with eDNA Explorer is to present data from different eDNA projects in a standardized format. Processing raw sequence data with Tronko allows project results to be compared with the other findings shared on our site.
How does Tronko process data?
Tronko is an innovative tool for quickly creating an organism list from the sequences in a raw eDNA dataset. The creators of Tronko are Dr. Lenore Pipes and Dr. Rasmus Nielsen of UC Berkeley. Tronko is a new tool, and a manuscript describing the method is currently under review, but you can get the preprint here, called “A rapid phylogeny-based method for accurate community profiling of large-scale metabarcoding datasets”. The main gain of Tronko is that it is phylogenetically based, and can place an eDNA sequence anywhere on a gene tree, including a new leaf or an internal node. Tronko will output a taxonomy assignment for each unique read in a dataset and will save the mismatches between the closest reference sequence and the one being reviewed. We display Tronko output tables as standard taxonomy tables that you can use in standard analysis. When you download this data, you can explore the taxonomy tables made with a range of mismatch filters.
How does Tronko work in more detail?
Tronko starts with a set of reference databases which contain genetic sequences with known taxonomies. It organizes these sequences into dozens to thousands of phylogenetic trees, grouping them by how related they are. The method to make these trees, also known as ‘ancestral clusters’, was published in Bioinformatics if you want to learn more. By clustering reference sequences, Tronko analyzes eDNA sequences quickly, because it only has to look at how an eDNA sequence falls in the best tree. An eDNA sequence can be placed anywhere on the tree where it belongs best, even an inner node or as a novel taxon branch.
Tronko assigns each unique read (which we call an amplicon sequence variant or ASV) separately. For paired reads, Tronko jointly assigns the forward and reverse read to taxa, and then if the results of the pair are not matching, it assumes the ASV is a chimera (which is surprisingly common) and kicks out the ASV. This usually removes 0-5% of our ASVs. All ASVs are described with summary statistics in the ASV download, and some of these statistics can be used as a filter.
Tronko outputs several statistics, but it’s important to first keep in mind that it heavily penalizes mutations between an eDNA sequence and a reference sequence --so some of the output taxonomy for ASVs may only be resolved to higher classification levels than species. We do use a Lowest Common Ancestor approach, where if an eDNA ASV has multiple nearly equally good reference matches, we assign the sequence to the taxon on the reference tree that is the common ancestor of those multiple sequences. So if a sequence, for example, matches equally well with dog (Canis lupus familiaris) and coyote (Canis latrans), Tronko will just call the ASV Canis. This setting in Tronko Assign is called -c and is set to 10.
Do you recommend certain filter settings with Tronko results in the Organism List and in other charts and graphs?
We recommend exploring the confidence slider settings in the Filters pop-up.
Confidence setting
The confidence slider in Filters controls what organisms are displayed based on how well the DNA sequences found for them in the samples (referred to as ASVs or reads) match against those in the reference database. It works by filtering sequences based on their "divergence" from a known reference DNA sequence.
What is Divergence?
Divergence is a measure of how different an eDNA ASV (or read) is from a reference DNA sequence. It's calculated as a percentage:
How the Confidence Slider Works:
Each setting on the confidence slider (1-5) applies a specific filter, balancing the assignment of eDNA reads with the accuracy of those assignments:
-
1 - Least Confidence:
-
Filter: Divergence ≤ 20%
-
Impact: This setting is the least stringent. It will assign the most eDNA reads to a reference, but with a higher tolerance for genetic differences, potentially leading to more false positives.
-
2 - Low Confidence:
-
Filter: Divergence ≤ 15%
-
3 - Default Confidence:
-
Filter: Divergence ≤ 10%
-
Impact: This is the default setting, offering a good balance between sensitivity and specificity.
-
4 - High Confidence (for longer sequences):
-
Filter: Divergence ≤ 10% AND a maximum of 25 DNA mismatches.
-
Impact: This setting is the same as the default (setting 3) unless the eDNA reads are longer than 250 base pairs. For very long sequences, it adds an extra layer of stringency by limiting the absolute number of allowed mismatches, even if the percentage divergence is low.
-
5 - Highest Confidence:
-
Filter: Divergence ≤ 5% AND a maximum of 10 DNA mismatches.
-
Impact: This is the most stringent setting. While it provides the highest confidence in the assigned eDNA sequences (meaning they are a very close match to the reference), it may "undercount" by rejecting many reads that are slightly more divergent, potentially missing valid detections. A mismatch is the number of mutations or gaps between an eDNA sequence and the reference database sequence. If a CO1 amplicon insert has 333 bp and you allow 10 mismatches, that’s roughly 5% mismatch. Up to recently, people were taking sequences with up to 5% divergence, making a consensus “OTU” and using that for counting species in eDNA. Consequently, we recommend that you think about the settings you may want to use when viewing data depending on if you want perfect matches or are counting taxa numbers in general. You also should consider how much error a polymerase has. Because there is no one-size-fits-all mismatch choice for a project and primer set, we let you play with this filter and choose what level of mismatch tolerance you want. Remember, a higher mismatch tolerance will allow you to estimate more taxa, which may be real but not correctly assigned because well…we haven’t sequenced all species on Earth yet! More taxa may help you better estimate alpha and beta diversity. Alternatively, allowing fewer mismatches will reduce the taxa returned but will help you more accurately create species lists. In summary, the mismatch setting depends on your goal: are you comparing biodiversity or making species lists?
You may also want to consider playing with the level of classification in the Filters pop-up. Many erroneous taxa are removed when you only look at genus or family-level assignments. We sum together all the species that share a genus when you select the genus-level view.
What is an organism group, primer, reference library and mismatch?
Organism Group / Primer
Organism Groups are categories of organisms that are targeted for identification in a study. Each Group corresponds to a Primer which are special short DNA sequences used in eDNA analysis. Primers are designed to focus attention on a specific region of an eDNA strand. A Primer matches up with each side of the region of DNA that one is interested in looking at more closely to mark it as the target area of interest – like little scissors that cut out a snippet of the strand that we want. The DNA code in the middle between two primers (the snippet we cut out) is called a barcode and is unique for each species. Once the primer helps us find the barcodes we are interested in, we use a technique called PCR to make a lot of copies of the barcodes so we can see them more easily. Primers can be specifically designed to look for barcodes that represent one species only or for a group of species. For example, ITS1 primers are designed to help us identify organisms in the Fungi group so they stick to the start and end of barcodes that most Fungi have. By viewing the results of different primers used when sequencing, we can discover different lots of different organisms groups in one sample across the tree of life.
Mismatch
A mismatch is a part of an organism’s DNA sequence that doesn’t match exactly what is found in the sequence for that organism inreference libraries. These mismatches are most likely mutations that accumulate as organisms evolve and diverge, but sometimes a mismatch can be because of a laboratory process. By default, the Organism List is set to confidence level 3 (out of 5) which lets 25 mismatches slide before the organism is not shown in the table. The higher the number of mismatches allowed, the lower the confidence you should have that that organism match is the right one. It’s like saying, it might be a duck if it kind of looks like a duck. In contrast, if you want the DNA to match exactly, no differences allowed, you would lower the Confidence Level to 5 in the filters which represents 0 mismatches. This is like saying if it doesn’t look exactly like a duck and quack exactly like a duck, it’s not a duck! Scientists explore how much of the sequence needs to match before they feel confident by understanding other information about the sample and the location where it was collected. One advantage of allowing more mismatches is that because we haven’t sequenced ALL species yet, you have a better chance of counting numbers of species if you allow a DNA sequence to be matched to a relative of what it actually is, rather than throwing it away entirely.
Reference library
A reference library is a collection of DNA information organized by unique organisms. Most organisms in the reference library have a unique DNA sequence associated with them, but there are always exceptions, like where two species are so closely related they don’t have any DNA sequence differences (also known as mutations, mismatches). eDNA Explorer identifies which plants, animals, bacteria or other living things are in the data collected by comparing the eDNA in the samples to the known DNA in the reference library and seeing what matches up. Our reference libraries are made by taking all the sequences available in NCBI – the United States’ public DNA data repository – and trimming them down to what could be picked up in PCR by a chosen set of primers. Each primer set has its own reference library. We recommend looking through the reference libraries linked at the top of the Project page to see if your species of interest are there. The reference libraries used by Tronko are in ancestral clusters (phylogenies) but the fasta and taxonomy files are available for exploration and for open use here: https://zenodo.org/records/15353120.
What do the GBIF comparison charts show?
The charts show the overlap in organisms found by eDNA and organisms found in the Global Biodiversity Information Facility (GBIF) website. The GBIF list is a great go-to list to help validate an eDNA observation or help identify where there may be gaps in reference libraries. Maybe eDNA data from an African park display Grevy’s zebra, but GBIF displays the Grant’s zebra. Maybe Grant’s zebra was missing in the reference library?
Source document: Google Doc