Q(uestion): Can I record this webinar?
A(answer): The webinar is being recorded at our end, and will be made available on this playlist in our Youtube channel (we will send the direct link to this webinar). You will not be able to record on your end.
Q: Is it possible to do a statistical analysis by comparing the abundance of 16S or 18S genes in a metagenomic dataset?
A: No. The reason being we typically discourage using 16S or 18S gene information from metagenomes. Since these genes are highly conserved between different microbes, they are poorly recovered from metagenomes, so not the ideal markers to use.
Instead, comparison of taxonomic composition can be done using the taxonomic assignment of protein-coding genes on contigs in the Stats tool (see features under “By taxonomy”).
Q: is there a possibility to make such comparisons using metatranscriptomes?
A: It is possible to use Stats Tool with metatranscriptomes, although we don’t have accurate coverage (read depth) estimates for ALL of them. IMG currently treats metatranscriptomes (mostly) like metagenomes (with a few exceptions).
Q: In the case of metagenomes, were they assembled or not?
A: The vast majority of metagenomes on IMG are assembled. The current example presented in the webinar is also assembled and included read depth (i.e. coverage) information. However there are a few legacy metagenomes with unassembled data as well. We recommend NOT using the Stats Tool with these unassembled legacy metagenomes.
Q: Are there also phylogenetic aware methods available?
A: If by “phylogenetic aware” you mean methods based on an actual phylogenetic tree (e.g. UniFRAC), then no, not currently. We do have a tool in development that will use general linear models to use taxonomic (phylogenetic) composition as a fixed variable, but that will be a later release..
Q: Can we compare our 16S amplicon data with the 16S genes derived metagenomes from the database in this tool based on taxonomic information?
A: This is doable using Find Genes > BLAST > RNA and choosing 16S rRNA assembled metagenomes database from the pulldown. See: https://img.jgi.doe.gov/cgi-bin/m/main.cgi?section=WorkspaceBlast&page=16form
Q: Can this tool be used to compare two or more groups of genomes to infer the extent to which functions are maintained or lost among genomes? Example: I have reduced genome size endosymbiont genomes, and I wanted to discover functions that are lost or significantly reduced compared to a set of free-living relatives genomes.
A: Yes, it can be used to compare groups of genomes. Endosymbionts vs free-living relatives is a great use case for this tool.
Q : Can I use IMG for functional analysis of minION generated data?
A: Yes, IMG includes genomes and metagenomes sequenced with MiniON sequencers, however if the assemblies are not properly polished using e.g. short reads or very high coverage, then the resulting assemblies and protein prediction may be suboptimal, which will impact the quality of the comparison.
Q: Does the ANOVA test assume that you have previously tested the homogeneity of variances or normality?
A: Yes, it does. The pipeline does not check for normality of distribution, it is assumed. See details for each test in the user guide under “Default statistical method.”
Q : Can analysis by taxonomy be done on metatranscriptomes?
A: In principle, yes, comparison of taxonomy can be used for analysis of metatranscriptomes, but with the caveat that metatranscriptomes are often amplified during library prep, meaning that the abundances will be biased.
Q: Can estimated gene copies be used on metatranscriptomes?
A: </>Yes, estimated gene copies are available for most recently added metatranscriptomes, but not all.
Q: For taxonomic assignment of a scaffold, what is the rationale of at least 50% of the genes being required to have a hit?
A: We did benchmark the taxonomic assignment of scaffolds based on the majority rule (>=50% of genes with the hits to the same lineage). It produces very few false positives (defined as an assignment to a wrong lineage), but quite a few false negatives (defined as an assignment to a correct higher-level lineage, such as phylum or class, when in principle an assignment to a lower-level lineage, such as order or family, was possible). We prefer to err on the side of caution with this rather conservative threshold for assignment.
Q Is the taxonomic affiliation similar to GTDB?
A: No, the taxonomy in IMG is NCBI, not GTDB.
Q: Is GTDB taxonomy available for genomes?
A: GTDB taxonomy is available only for metagenome bins, not isolate genomes (see Metagenome Bins webinar for more information).
Q: Does “carefully curate” (in the presentation) mean high quality genomes and scaffolds in each set?
A: Curate means multiple things: high-quality genomes, metagenomes with similar levels of coverage (avoid comparing metagenomes with drastically different total assembled length and or gene count). But it also means curating the “metadata”: e.g. if you want to compare freshwater vs marine genomes for a specific taxon, curating would be making sure that the genomes in the “freshwater” group are really from freshwater environments, etc. Descriptions could be inadequate or misleading in some cases – it’s always better to either refer to a publication for the samples or contact the P.I. for additional details that may not be captured.
Q: For a publication, can I do a statistical comparison like the demo example with 2 samples with two different sample sizes? 30 and 14?
A: Yes, you can use samples with different sizes (although ideally within the same order of magnitude)
Q: What should the minimum number of samples be in each set for good statistics?
A: There is no real “minimum” number of samples, you could technically do a comparison between 2 groups of 2 (meta)genomes each. The important part would be to not over-interpret the results, i.e. if your groups are small (e.g. less than 10, or even less than 5), then any results should be carefully presented as only based on a very small sample size.
Another consideration: the “minimum” number of samples depends on how different are the samples you’re comparing (in a sense of a statistical power, please see https://en.wikipedia.org/wiki/Power_of_a_test for a discussion of the relationship between sample sizes and magnitude of the effects). Generally speaking, if you compare human gut to marine samples, you’re likely to find significant differences even with 2-3 samples per group. On the other hand, if you’re comparing soil samples from different plots, even 50 samples per group may not be enough to find statistically significant differences. In such cases you may want to download the “Full Results” file generated by the tool and try more powerful tests better suited for your particular example.
Q: Can a genome can eliminated from genome set
A: Yes you can edit a genome set. View the list of genomes in the set. Select the genomes you want to save (i.e. all the genomes except the one you would like to eliminate). Click the “save” tab. In “save to workspace” choose “replace” option.
Q: If we have submitted our metagenome sequences in NCBI database, is it possible to pull the study into IMG?
A: IMG does not automatically import metagenomes from NCBI, however you can submit your (assembled) metagenomes to IMG to get it annotated and have it available for this type of statistical comparisons we are showing today. Please view our webinar on the topic of data submission.
Q: Using the Mann-Whitney test, I get a p-value of 0.0005 for many features comparing two metagenome groups…but the false discovery rate (FDR) adjusted p value is increased to 0.5….what does it mean? What is most important – p -value or adjusted p-value?
A: “Adjusted p-value” is the one you should be looking at (it is adjusted for the number of comparisons performed). Here’s something to read about false discovery rate and why it’s important for genome and metagenome analysis: https://en.wikipedia.org/wiki/False_discovery_rate
Q: Mean here means…..mean number of scaffolds?
A: “mean” is the mean count of the feature being compared. For instance, if it’s one of taxonomic levels (like class or phylum), then it’s the number of scaffolds assigned at this level, multiplied by their coverage if the “estimated gene copies” option was selected.
Q: Can you do this analysis at the genus level?
A: yes, you can. Just keep in mind that there will be fewer contigs with taxonomy assignments at the genus level than at higher levels; meaning that the results may be skewed (e. g. if you’re comparing communities with many populations that don’t have close relatives in the isolate genome reference database that is used for scaffold lineage prediction).
Q: Can we perform these analyses with central log ratio transformations?
A: You can’t do this directly in IMG, however the “Full results” file you download will include the input matrix of feature counts, which you could then use in your favorite statistical analysis software to try more transformation and/or tests.
Q: Is it possible to restrict features to a specific taxon. For example, can I compare KOs between my groups of samples only for scaffolds annotated as Archaea?
A: This feature is being beta-tested at present. We hope to make this type of comparison possible in the near future!
Q: So what if I am interested in organisms that are not present in the IMG database, such as nematodes or fungi?
A: IMG is very prokaryote-centric. We don’t load eukaryotic genomes other than as a reference for metagenome analysis.
Q: KEGG mentioned that using their resources requires a licence. Does this hold when using them through IMG?
A: IMG has a KEGG licence which enables us to provide it to our user, so you don’t have to worry about having a licence on your side.
Q: Once approved, do you know how long it would take to annotate an assembled metagenome?
A: Rule of thumb would be a few weeks for an average sized submission, however it could also take several months, depending on the current load of the system, the size of the backlog, and the size of your dataset.
Q: How efficient is this stats tool for comparing novel species that may contain novel functions not represented in these functional annotation databases (like KO, Pfam, COG, Tigrfam, etc)?
A: For highly divergent or novel groups, there is a possibility of missing out on additional differentially abundant conserved hypothetical proteins that are not captured by any of these databases. On average, about 80% of total CDS of bacterial genomes are assigned to a function annotation database.
Q: How does assembly impact the relative gene/function counts relative to unassembled metagenomes?
A: That is what “estimated gene copies” measurement tries to address – by multiplying the feature count by the average read depth of the scaffold. However please ascertain that all the samples in all your groups do indeed have coverage information available to calculate estimated gene copy. If even one sample in your input is missing this information, the stats tool will report an error. While most of the JGI-sequenced metagenomes do possess this information, many others do not.
Q: Can you graphically summarize the results of the stat analysis in this set-up such as the ternary plot example from Vigneron et al 2017 Sci Rep.?
A: For advanced visualization, we recommend downloading the full results set and using external statistical software (e.g. R).