IMG Webinar: ANI (Average Nucleotide Identity)

Q: Does ANI take into account plasmid genes or only chromosomal genes?

A: The intention is to only use chromosomal genes (since plasmids can be recently acquired) – and for complete or finished genomes, where the plasmids are known or discernible, these genes are excluded. However since the majority of isolate genomes are draft genomes – and plasmids are not delineated – all genes including plasmid ones will be utilized for ANI computation.

Q: What is BBH again?

A: Bidirectional Best hit ( or reciprocal best hit)

Q: Can we completely rely on ANI and dDDH (digital DNA-DNA hybridization) values for novel species description?

A: Strictly speaking, in order to define a new species, you would still be required to do DDH (DNA-DNA hybridization), not dDDH or ANI, which are in silico methods. That’s in addition to the required chemotaxonomic and phenotypic characterization and 16S rRNA sequencing. However, ANI cutoffs are highly correlated with DDH values, as described in this publication.

Q: Is it possible to do the comparison with incomplete or draft genomes?

A: Yes, it is, many of the draft genomes in IMG are incomplete – however the “degree” of incompleteness matters – this could affect the AF (alignment fraction) thresholds. For example – if you’re comparing two highly similar genomes where one is only 50% complete (this would be an exception of course, few isolate genomes in IMG are this poor) – the ANI may be high in both directions (G1->G2 and G2->G1), but the AF will be low for the higher completeness genome. Also if one genome is highly fragmented (has too many contigs), the method will have problems finding BBHs (bidirectional best hits) of fragmented genes leading to possibly incorrect results.

Q: How robust is the described 96.5/60 ANI/AF threshold?

A: Please read the publication benchmarking the ANI/AF thresholds against 16S distances and DDH based on several thousand high quality genomes. However, you have to keep in mind that current practice in microbiology is to use polyphasic taxonomy, which uses additional information, such as chemotaxonomy, phenotypes, growth requirements, etc., to justify delineation of new species. As a result, ANI/AF may agree with 16S and DDH, but disagree with polyphasic taxonomy. This is very common among human and animal pathogens like Brucella spp., that are nearly 100% identical over their entire genome sequence, yet are still classified into different species based on their host specificity.

Q: Would 96.5 threshold also hold for MAGs? I’ve seen 95% is used.

A: Yes, the method should be generally applicable to any genome-like objects, be it isolate genomes, MAGs or SAGs (single cell amplified genomes). But as with draft isolate genomes – level of completeness can impact AF threshold, as well as ANI.

Yes – 95% ANI is indeed typically used to dereplicate MAGs in many published studies in a conservative approach; however, this is a whole-genome ANI, which is based on both genes and intergenic regions, and includes both orthologous and paralogous genes. MiSI ANI cutoff is generally higher because it considers only the nucleotide sequences of protein-coding genes (CDSs), which have to be BBHs (bidirectional best hits, i. e. likely orthologs) and therefore are more conserved even on nucleotide level. MiSI cutoff of 96.5 is generally equivalent to whole-genome ANI of 95.

Q: Potentially two members of a clique group could fail to meet the ANI and AF criterion for a clique. WHAT HAPPENS THEN?

A: Clique-group means that not all genomes had pairwise ANI and AF above the cutoff. These are mainly found in species with “open pangenomes” and help distinguish isolates that are possibly in the process of becoming new species.

Q: Can you remind people what services will be affected by your outage July 9-14?

A: Unfortunately, all services will be down, although we can’t tell for how long within that period, might not be the entire time.

Q: When we calculate ANI, is it always between only two genomes? Or can I do this calculation between databases?

A: Yes, ANI computation is always between 2 genomes. It doesn’t do genome-wide multiple sequence alignment. You can select a set of genomes for pairwise computations between them, and then find clusters (cliques and clique groups as described by Neha in the presentation). Also see: https://img.jgi.doe.gov/cgi-bin/mer/main.cgi?section=ANI

Q: Is there any ANI value for genus level demarcation?

A: We didn’t benchmark ANI for a genus level demarcation, because we found that it’s too variable. Average Amino acid Identity or AAI methods do provide some benchmarks – please review: http://enve-omics.ce.gatech.edu/mytaxa/

Q: When doing a same species plot, the repeated (or duplicated) genomes are also taken into account, is there any way to sub-select specific genomes?

A: Yes this is possible using the Pairwise ANI tool – add the subset of genomes in your genome cart and make the appropriate selections in the input windows. In the same species plot, clicking on each circle will generate a pop up that gets filled with genomes in the pairs in the circle, and then these can be selected and added to your genome cart.

Q: Are webinars stored so we can go back to the explanations some other time? If so, where are they stored?

A: All webinars are recorded and we will post the links soon. They will all appear in this Youtube channel soon. IF you have additional questions about IMG, please contact us.

Q: Does GenBank and RefSeq use ANI for species annotation as well?

A: Yes, they do use a similar method, although for details you’d have to contact them directly.

Q:Can you upload genomes assemblies that are not in JGI and use ANI tool? For example, there are only 12 genomes for Bacteroides vulgatus in JGI. There are many more in NCBI.

A: There is an “upload file” option in the pairwise ANI tool. You can submit a multi-fasta of nucleotide sequences of protein coding genes for your other genomes using this. However multiple genomes will need to be uploaded one at a time.

We do import NCBI genomes to IMG periodically, however there is currently a backlog – also, the intention is to capture the breadth or diversity of all available sequenced genomes, but not the depth (as you’re experiencing). HOWEVER, that being said – If there are any particular NCBI genomes you wish for us to import, you can submit a ticket here: https://img.jgi.doe.gov/cgi-bin/mer/main.cgi?section=Questions

Q: Does ANI answer questions related to Synteny?

A: ANI is independent from synteny. IMG does have synteny viewers available under the Compare Genomes menu.

Q: Does IMG import metagenomes from NCBI also?

A: No, we only import isolate genomes from NCBI, though we import some metagenomes from SRA, assemble, and upload.

Q : OK, how to find the metagenome (submitted in SRA) in IMG?

A : You can search GOLD database (gold.jgi.doe.gov) using fields such as SRA experiment title, or both IMG and GOLD using identifiers such as NCBI BioSample and BioProject to find the metagenomes from SRA that you’re interested in. If you find no hits, it most likely means that the corresponding metagenome is not in IMG. As a general policy, we don’t process human-related metagenomes with the exception of those generated by the Human Microbiome Project. However, you may assemble them yourself and submit them to IMG for annotation.

Q : Please send me the link to submit my metagenome SRA data

A : Please keep in mind that IMG only accepts assembled metagenome contigs and not reads. Please check our data submission site and webinar on this topic.

Q : In pairwise ANI tool – I saw a limit of 100 genomes, can I increase this number?

A: Sorry, no, 100 is a hard limit because of limited resources to calculate ANI data on the fly.

Q: Is there a significant speed improvement when submitting jobs outside of high-use hours (e.g. submitting late at night)?

A: We have users from all over the world, so there is no difference. It doesn’t really matter when the computation request was submitted since IMG will run it in the background and choose time of low-use. However, if you’re running on-the-fly searches, they may be affected at high-use hours.

Q: I see some of the comparisons (of Bradyrhizobium japonicum) do not meet the 96.5 ANI threshold – but they are named as the same species? Can we infer that the 20 B. japonicum genomes shown in the plot belong to different species and not just one?

A: yes, good catch – there are many reasons for this. It is quite possible that the strain is misclassified when it is sequenced; its taxonomy may or may not be revised at a later stage. IMG gets strain taxonomy from NCBi via GOLD; there may be a lag in updating NCBI taxonomy, in updating GOLD taxonomy or in updating IMG taxonomy, so the latest taxonomic assignment may not be reflected in the UI. In addition, the landing page of “Same Species” tool with the counts of genomes and clusters is updated on a monthly basis, so it may not be in sync with the latest IMG taxonomy.

In the case of B. japonicum in particular, many strains were classified as B. japonicum based on their nearly identical 16S sequences. However, it’s been long recognized that 16S divergence in Bradyrhizobium spp. is low and does not correlate with whole-genome divergence (especially due to the presence of multiple plasmids and megaplasmids). As a result, strains were classified originally as B. japonicum – but ANI (and alternative phylogenetic markers that are less conserved than 16S) suggests these are more divergent and support their possible reclassification into different species. For some strains it may have already happened, but the update has not been reflected in the UI, so in the cases of outliers it is always advisable to check their taxonomy in all possible sources (IMG, GOLD, NCBI). The Genome Details page of the respective genome IMG provides the links to navigate to all these resources.

Q: Does IMG have a set of type (well characterized) strains of a genus that can be used for ANI instead of searching all genomes in the database?

A: Yes, IMG does have a field that specifies whether the genome is that of a type strain or not. You can use Advanced Genome Search to find them using “Cultivation Metadata” – Type Strain option. Please see the very first query in the “public list” of queries that utilizes this field. Or you can reconfigure an existing genome table by selecting “Type strain” from the list of metadata options that appear below the table.

Q: Is it possible to download the complete JGI annotated genomes data?

A: You can download JGI-annotated genomes with JGI Data Portal. Please see our previous IMG webinar about Data export and download. Also read the Q&A for additional details.

Q: How do I interpret the ANI value? When it’s higher, is it more similar?

A: Yes, that’s correct – higher ANI AND AF (alignment fraction) suggests more similar genomes and more closely related strains. Please note that the AF has to be high as well (>=60% is the minimum threshold for species level cluster delineation). If only ANI is very high, but the AF is very low – an extreme example could be 99% ANI and 10% AF – this would not be interpreted as a “highly similar” genome.

Q: How is the MiSI method different from BLAST/ Mummer to calculate ANI value? Is there any difference in threshold value using different methods to calculate ANI value?

A: There are a few things to consider: first, different aligners use different (fixed or variable as in LAST) word length to seed the alignments, as well as different gap creation and gap extension penalties and alignment dropoff scores, which results in different sensitivity. For instance, blastn has a fixed word length of 11, Mummer has word length of 20, while megablast (one of nucleotide BLAST options) has the word length of 28, making Mummer less sensitive than blastn and megablast even less sensitive than Mummer. This generally means that aligners that use longer word lengths will find only nearly identical matches between genomes, while BLASTn will find more divergent sequences with lower % identity. As a result, the specific cutoff for within-species ANI will slightly vary depending on the specific aligner (including specific BLAST option – BLASTn vs megablast vs discontiguous megablast) used for ANI computation.

Another issue is that some ANI computations rely on whole-genome sequences including both genes and intergenic regions, while MiSI uses only nucleotide sequences of protein-coding genes (CDSs). The latter are generally more conserved than intergenic regions. Furthermore, the CDSs on which MiSI ANI is computed have to be BBHs (bidirectional best hits, i.e. likely orthologs), which means that they have higher conservation even at nucleotide level, resulting in higher ANI values than those calculated from whole genome sequences with or without fragmentation.

Overall, there are slight differences between ANI values computed using different methods, but they generally seem to be within 1-2% of each other.

Q: Can we also calculate TETRA values in IMG?

A: We don’t calculate “TETRA” values specifically – if you’re asking about tetranucleotide frequencies, we have a Kmer analysis tool in Scaffold Cart and in Scaffold Workspace. As well as the individual genome details page under “Scaffold consistency check”. See an example.

Q: On IMG, can we get ANI plot for same genus analysis as well?

A: No, you cannot. We do not have benchmarked thresholds for a “same genus” analysis. However, you can choose to calculate ANI for any subset of genomes using Pairwise ANI and plot those results elsewhere like in Excel .

Q: Does JGI have any plan for AAI (Average Amino acid Identity) analysis tools?

A: We don’t have any plan for AAI analyses for now. However, please complete our survey and make this suggestion, we can certainly consider it for future development. Also if you have other questions, please contact us.

Q: Where can we find the ANI in this species plot?

A: If you hover over a dot – the two genomes used in the pairwise computation are displayed – for ANI and AF, you would need to consult the X and Y axes – the smaller of the ANI and AF in each pairwise comparison is plotted. Please see a user guide.

Q: orthoANI and dDDH are two other genome-based methods used for species delineation. What is your take on these two methods in terms of how they compare with gANI for species delineation?

A: orthoANI produces similar results to ANI and MiSI, since all methods attempt to compute average genome identity. It’s mostly the convenience of computation – we find nucleotide sequences of protein coding genes more reliable than random genome fragmentation used in orthoANI. dDDH also produces similar results, the difference is that it’s a whole-genome alignment which computes distances based on high-scoring segment pairs converted into a distance, which is correlated with experimental DDH measurements.

Q: Can I use these tools only with IMG data or can I upload my own draft genomes for comparisons?

A: Most IMG tools require your genomes to be in IMG – however pairwise ANI tool allows you to submit your external genome (protein CDS FASTA) for comparison using the “upload file” option. If interested in submitting your own genomes to IMG, please check the IMG Submission webinar

Q: Is it valid to represent ANI distance in euclidean metric?

A: Generally speaking, ANI is not a metric in mathematical sense, since it is not symmetrical. For two genomes, G1 and G2, alignments for every pair of BBHs (g1->g2 and g2->g1) aren’t necessarily identical. And averaged over the entire genomes this may result in slightly different ANI values for G1->G2 vs G2->G1. In Pairwise ANI tool IMG reports both values, while for Same Species tool and plot it shows the lower ANI of the two.

Q: Do you have a tool to generate a phylogenetic tree after ANI?

A: In general ANI has limited dynamic range and therefore is not suitable for generating phylogenetic trees beyond genus-level groups. Even BLASTn can’t detect sequence similarity with <70% nucleotide identity, which roughly corresponds to the divergence of the most conserved protein families (such as ribosomal proteins) at family level. That said, the pairwise ANI values can be exported as an Excel file and converted into a similarity/dissimilarity/distance matrix using, for instance, tools in R. And then using this matrix to compute a UPGMA or a neighbor-joining tree.

For larger phylogenetic distances it is better to use a robust phylogenetic marker protein (e. g. beta or beta’ subunit of RNA polymerase), multiple sequence alignment and a maximum-likelihood or Bayesian inference tree. IMG provides tools for finding phylogenetic marker genes (e. g. based on their COG or KO term assignments), exporting their sequences and/or multiple sequence alignment (on a limited scale). Please refer to the previous IMG webinars for details.

Q: What is the difference in calculating the ANI on IMG and in other tools? The plotting and access look great. Anything else?

A: In essence all of these tools are similar and shouldn’t have major discrepancies other than those that may arise due to the differences in aligners used (like BLASTn vs LAST vs Mummer) and specific nucleotide sequences used to compute the ANI (whole genome vs chunked whole genome vs nucleotide sequences of orthologous protein-coding genes as in MiSI). We have seen differences of no more than 1-2% ANI between different methods.

Q: What would be the result of this analysis with MAGs? Would it compare similarity between just the MAGs or single genomes (genes?) within?

A: MAG is a collection of contigs.The ANI between MAGs is the average percent identity between nucleotide sequences of orthologous (BBH) protein-coding genes encoded on these groups of contigs. It is computed the same way as ANI between two isolate genomes or between a MAG and an isolate genome.

Q: What would be the minimal ANI you would consider reliable ?

A: Probably 90%. Below 90% the AF will be dropping precipitously as well.

Q: Are there any more Webinars planned?

A: We are in the planning phase, please stay tuned. We will reach out with an IMG Communique email when all is finalized.