Skip to main content

Data processing for project owners

FAQ: How data is processed - for project owners

Why do you need sequence data? I already have results to share. Our aim with eDNA Explorer is to present data from different eDNA projects in a standardized format. For this to work, we ask users to upload their raw, unprocessed data to be analyzed with Tronko. This allows their results to be compared with the other findings shared on our site. If you have other results or publications you want to share through eDNA Explorer, you can add a DOI link to your project page.

Can I add more samples to a project I’ve uploaded? Samples can be added to a project at any time. Keep in mind when you add new samples, you will have to update your metadata with the new sample information. If you are adding samples to a project that is already processed or published, click the “Add and manage data” button. From there, you can add new fastq files and an updated metadata spreadsheet.

How does Tronko process my data? Tronko is an innovative tool for quickly creating an organism list from the sequences in a raw eDNA dataset. The creators of Tronko are Dr. Lenore Pipes and Dr. Rasmus Nielsen of UC Berkeley. Tronko is a new tool, and a manuscript describing the method has been published in eLife. The main gain of Tronko is that it is phylogenetically based, and can place your eDNA sequences anywhere on a gene tree, including a new leaf or an internal node. Tronko will output a taxonomy assignment for each unique read in your dataset and will save the mismatches between the closest reference sequence and yours. We give you your Tronko output tables as standard taxonomy tables that you can use in standard community ecology analyses. When you download them, you can explore the taxonomy tables made with a range of mismatch filters.

How does Tronko work? Tronko starts with a set of reference databases which contain genetic sequences with known taxonomies. It organizes these sequences into dozens to thousands of phylogenetic trees, grouping them by how related they are. The method to make these trees, also known as ‘ancestral clusters’, was published in Bioinformatics if you want to learn more. By clustering reference sequences, Tronko analyzes eDNA sequences quickly, because it only has to look at how your eDNA sequence falls in the best tree. Your eDNA sequence can be placed anywhere on the tree where it belongs best, even an inner node or as a novel taxon branch.

Tronko assigns each unique read (which we call an amplicon sequence variant or ASV) separately. For paired reads, Tronko assigns the forward and reverse read on its own, and then if the results of the pair are not matching, it assumes the ASV is a chimera (which is surprisingly common) and kicks out the ASV. This usually removes around 5% of our ASVs. All ASVs are scored with summary statistics you can use as a filter.

Tronko outputs several statistics, but it’s important to first keep in mind that it heavily penalizes mutations between your eDNA sequence and a reference sequence. So some of the output taxonomy for ASVs may only be resolved to higher classification levels than species. We are working on a lowest common ancestor approach to come out in a few years.

Do you recommend certain filter settings with Tronko results?

We recommend exploring the mismatch setting for filtering. This is the number of mutations or gaps between your eDNA sequence and the reference database sequence. If your CO1 amplicon insert has 333 bp and you allow 10 mismatches, that’s roughly 3% mismatch, and up to recently, people were taking sequences with up to 3% divergence and making a consensus “OTU” an using that for counting species in eDNA, so think about the settings you may want to use depending on if you want perfect matches or are counting taxa. You also should consider how much error your polymerase has. Because there is no one-size-fits-all mismatch choice for a project and primer set, we let you play with this filter and choose what level of mismatch tolerance you want. Remember, a higher mismatch tolerance will allow you to estimate more taxa, which may be real but not correctly assigned because well…we haven’t sequenced all species on Earth yet! More taxa may help you better estimate alpha and beta diversity. Alternatively, allowing fewer mismatches will reduce your taxa returned but will help you more accurately create species lists. In summary, the mismatch setting depends on your goal: are you comparing biodiversity or making species lists?

You may also want to consider playing with the level of classification. Many erroneous taxa are removed when you only look at genus or family-level assignments. We sum together all the species that share a genus when you select the genus-level view.

Source document: Google Doc