IMG Webinar: Sequence Similarity Searches

Q(uestion): Is it possible to have access to the first part of this Webinar series?
A(nswer): Yes, IMG conducted a pilot webinar series with four topics – these individual webinar recordings can be found here in the “PAST WEBINARS” section.

Q: Is it free to use the Workspace?
A: Yes, it’s free. Workspace is only available in IMG/MER version however, for which you need to request an IMG account, which is free too. All IMG tools and services are free to users.

Q: Are there any IMG tools to create consensus sequences?
A: Sort of. You can generate a Clustal based alignment of genes (nucleotide or protein) residing in your gene cart using the “Sequence Alignment” tab, and clicking on “consensus” in the resulting alignment viewer. However this is not the ideal strategy, we recommend using alternate tools that are designed explicitly for this purpose and provide a full range of functionality for editing and creating multiple sequence alignments and consensus. If you’re interested in online resources in particular, we recommend: http://www.phylogeny.fr/index.cgi

Q: What is the différence between the blast and psi-blast?
“PSIBLAST uses position-specific scoring matrices (PSSMs) to score matches between query and database sequences, and assigns higher weight and larger scores to highly conserved positions in the alignments to multiple subject sequences. Searching with PSSM can detect hits to more distant sequences (<20% sequence identity), as long as they have these conserved positions. In contrast, BLAST uses position-independent scores, i. e. the same scores for all positions in the alignment, which can be found in scoring matrices such as BLOSUM62, regardless of whether they are highly conserved or highly variable. Please read further: https://www.ncbi.nlm.nih.gov/books/NBK2590/

Q: In a BLAST pairwise protein alignment, what does the + represent?
A: “+” symbol denotes similar “chemical property” – the two amino acids are in the same “class” and can possibly replace each other without affecting the overall function/structure of the protein.

Q: Does JGI offer services for human stool sample or oral sample’s 16s or metagenomics or metabolome sequencing?

A: No it does not, JGI is funded by the U.S. Department of Energy (DOE) and DOE focus research areas are restricted to biogeochemical cycles, bioremediation, biofuels and such. For JGI’s product portfolio, please see: https://jgi.doe.gov/our-science/product-offerings/. To gain access to these capabilities, please review our user program pages and submit a proposal.

Q: Does JGI IMG database contain data derived from human samples?
A: Yes, it does. IMG does have numerous isolate genomes as well as environmental metagenomes arising from human host-associated environments (such as human microbiome project (HMP)).

Q: If I have 250 genes in my gene cart and I want to get (using blastp) only the sequence of the top hit isolate (top blastp hit) for each gene (not all alignments, only the sequence of the best hit/alignment), is there a way to automate that search? (instead of going gene by gene)
A: You would have to break your list of query sequences into multiple batches and submit searches using Find Genes > BLAST > All Isolates and set the “number of hits” to 1. This will produce a results table with the top hit per query sequences. The query limit is 10,000 characters (including headers) – so assuming an average protein query length of 300 aa, you can submit about 30 query sequences in each batch and repeat about 10 times to get your results for 250.

Q: Is there a way to find the top isolate hit with the gene neighborhoods option

Not at present, one workaround (IF your query gene is assigned to a COG or Pfam only) would be to use Find Genes > “Cassette search”. Alternatively, you can use “Top IMG Homologs” option to retrieve a set of homologs, add the best isolate hit (or multiple hits) to Gene Cart and use “Neighborhoods” tab to visualize their neighborhood conservation.

Q: What’s the difference between LAST and BLAST?
A: Citing http://last.cbrc.jp/: “The main technical innovation is that LAST finds initial matches based on their multiplicity, instead of using a fixed length (e.g. BLAST uses 11-mers). To find these variable-length matches, it uses a suffix array (inspired by Vmatch). To achieve high sensitivity, it uses a spaced suffix array (or subset suffix array), analogous to spaced seeds (or subset seeds)”

Q : How did you get to the Advanced Genome Search page?
A: https://img.jgi.doe.gov/cgi-bin/mer/main.cgi?section=GenomeSearch&page=searchForm

Q: How do you see the list of pre-loaded public advanced genome search queries?
A: Public list is here: https://img.jgi.doe.gov/cgi-bin/mer/main.cgi?section=Workspace&page=public_genome_search_history

Q: Does registration allow for analysis of >1000 bacterial genomes? if yes, Is there a roof to genomes that can be analysed in parallel?
A: There are various comparative genomic tools in IMG with varying “limits” – some of these limits can be extended by using Workspace functionality. If you mean BLAST search of 1000 genomes – “All Isolates” BLAST runs against the entire IMG database, which is significantly larger than 1000 genomes. However, there are limits placed on the size of the query (up to 10,000 characters including the headers – so multiple query sequences may be submitted) as well as the number of hits (<=500), i. e. you may not be able to see hits from all genomes. If you need an exhaustive search to find hits in each and every genome out of your set, you should use Find Genes -> BLAST -> Selected Isolates (up to 100 genomes/metagenomes in one batch) OR Workspace -> Genome Sets -> BLAST (up tp 500 genomes/metagenomes in one batch); in this case the limit of <=500 applies to each individual genome/metagenome, not the total count of hits displayed.

Q: On what basis was Cultivation metadata selected?
A: Please view this webinar on Advanced Genome Search that explains functionality and organization of underlying metadata.

Q : Can we upload and analyse our own genomes or metagenomes using IMG?
A: Yes, please visit our submission page and follow guidelines. You can also view our Data Submission and Management webinar

Q : Does IMG have all the prokaryotic sequence If we compare with NCBI
A: No, not all NCBI data. There is a small backlog at present. Also our objective is to have a broad diversity represented – so not every available strain of every genus and species will be loaded (especially for cases like Mycobacterium tuberculosis or Staphylococcus aureus with 1000s of available strains), but at least one strain of every available species should be included. Please contact us if you find specific genomes are missing and would like to see it added to IMG.

Q: So IMG has what they sequence from JGI and external submitters?
A: Yes, IMG contains all JGI-generated datasets, as well data from external submitters and NCBI. For isolate genomes, as stated above, we do import all public data from NCBI (e. g. IMG has only a small subset of strains from human pathogens). For metagenomes as well, there is a large number of non-JGI-generated datasets (for eg., HMP, TARA oceans), but it does not contain every dataset in the public domain.

Q: Does IMG do metagenomic binning?
A: This is provided as a service for JGI sequenced projects primarily. At present, we have auto-generated 89,672 bins from 11,061 public metagenomes. Please visit our Metagenome Bins page and view the webinar on this topic.

Q: Are the databases available for running a command line BLAST? Does IMG have an FTP resource?
A: All JGI-generated datasets in IMG can be downloaded in bulk using the Genome Portals pages. However, the downloads include only fasta files, from which BLAST databases can be generated, not the databases themselves. We don’t provide BLAST databases for download, since generating them is trivial if you have a fasta file.

Q: do we prefer draft genome while running similarity search or complete sequence
A: Depends on your objective. Since draft genomes could have physical or sequencing gaps, “absence of evidence” against such a genome does not confirm “evidence of absence”.

Q: do you have database of non coding RNA?
Yes, we have the databases for non-coding RNAs. For genomes we have a combined database for all non-coding RNAs (all rRNAs, tRNAs, as well as others, such as tmRNA, RNase P RNA component, etc) and 16S rRNA database. For metagenomes we have specialized databases for 5S, 16S, 23S, 18S, 23S, 28S and other non-coding RNAs (including tRNAs, tmRNAs, etc). These are generated by collecting the respective sequences from assembled metagenomes and metatranscriptomes.

Q: Do you allow submission of MAGs
A: In principle, we do accept submission of MAGs in 2 ways: a small set of high-quality MAGs can be submitted as genomes, while for large sets of variable quality we prefer that you submit a metagenome and these MAGs as bins (see IMG Metagenome Bins as an example of this format). Please contact us via “Contact Us” link in Help section for additional information.

Q: What is the difference between IMG search and NCBI BLAST?
A: The IMG Gene Search by BLAST uses the BLAST program same as in GenBank, but the database that is searched is different. Differences in databases arise from different content in IMG versus GenBank.

Q: Also could we use that tool for plant genomes?
While plant genomes sequences are included in IMG, they may not be completely updated with all public genomes. JGI has the Phytozome portal dedicated to plant genomic analyses.

Q: Are the TARA oceans metagenomes and metatranscriptomes in the IMG database?
A: Yes, they are.

Q: In IMG, can we work on 18S rRNA metagenome amplicon analysis
A: IMG does not contain amplicon data, however 16S or 18S rDNA genes are predicted (to the extent possible) from metagenome assemblies, and can be analyzed further or downloaded. In addition, you can run BLAST against IMG collection of 18S sequences collected from metagenomes using Find Genes -> BLAST -> RNA. You can use a multi-fasta file with multiple sequences as a query, but keep in mind that the query size is limited to 10,000 characters

Q : What is the minimum cut off for percent identity if 16S for same species
A: The species %identity cutoff reported in many publications is 97%, but that is not a hard and fast rule.

Q : does it provide sync to R stasts
A: We are not sure we understand the question. BLAST results can be exported as tab delimited files, and then used with any software package that understands tab-delimited format. IMG does not provide BLAST via API calls.

Q: Can we compare more than 3 genomes on this platform?
A: Yes you can. A variety of options are available under “Compare Genomes” and elsewhere. Please stay tuned for upcoming webinars.

Q: When I BLAST a non-16S sequence query against a metagenome, I often wonder what correct thresholds of % identity are appropriate, or even if this question is appropriate.
A: If you mean thresholds for taxonomic assignments, please review: https://academic.oup.com/nar/article/42/8/e73/1076763

Q : Can we search for a specific sequence against multiple metagenomes via BLASTP?
A: Yes. Add your metagenomes to the cart. Go to Find Genes > Blast > Selected Genomes to launch your BLAST job. Only 100 genomes/metagenomes can be searched at one time. IF you want to search up to 500 metagenomes at once, use the Genome Set BLAST functionality in the workspace.

Q: While filtering a table, can you do a negative filter (not plants)?
A: Yes, table filters do allow regular expression searches – please review this webinar at minute 26:39 on how to do this. Also more details were provided in the “Tips and Tricks” document linked to the Q&A transcript of that webinar.

Q: Does finding hits from an aquatic metagenome really means that the bacteria lives in this environment? Or could it be DNA washed from the soil and then recovered from water?
A: Yes, sporadic occurrences are always a possibility, even though the sequences that end up in IMG metagenomic rRNA databases are usually those of relatively abundant lineages. But it would be better to assert a claim based on multiple instances from multiple datasets or studies.

Q : I have my set of genomes saved and I want to find a specific gene but I just have its function. So I went to “find function tool” – gene product name. I got a table with all the hits throughout my genomes. Is there a way to download the alignment and also see the gene neighbors for all genomes I am working with?
A: To view an alignment of your genes, you need to add them all to the gene cart and use the “Sequence alignment” tab to generate one. For the gene neighborhood, see the answer to the next question (below).

Q: can we run the same homolog search like this NifH gene search by using a gene cluster (neighbor genes)?
A: It depends on what kind of information you expect to find as a result of such search. If you want to see neighborhood conservation (i. e. gene order, orientation, and functions), you can use “Top IMG Homologs” to find best hits of your query gene, then add them to Gene Cart and use “Neighborhood” tab to review the conservation of chromosomal neighborhoods. This viewer doesn’t show best hits, instead it shows assignment of neighborhood genes to protein families (specifically COGs). If you want to make sure that the neighborhood genes have hits in the same genomes within the same % identity range, you can use Find Genes -> BLAST -> Selected Genomes or Workspace -> Genome Sets -> BLAST with sequences of several neighbor genes as a query. On the other hand, if you are looking for all genomes in which two genes of your interest are found next to each other (or within a certain distance of each other), we are in the process of developing a “Cluster scout” tool to find conserved clusters of genes – at present we have Find Genes > “Cassette search”- you would need to use COG or Pfam (for NifH and its neighbor).

Q : A question just out of curiosity. Have you collaborated with NCBI (or any other database) to get all these sequences and information?
A: No, we don’t collaborate with NCBI on this particular task. We do collaborate with NCBI on the issues of taxonomy and genome submission. In addition, we import NCBI isolate genomes and process public data from SRA to include in IMG (when we have the bandwidth).

Q: When %identity is high but with a small coverage there is the possibility of only a conserved shared domain. Is filtering with both %identity and coverage possible?
A: Yes, you can filter the BLAST results table using alignment length – however you cannot specify a minimum alignment length in the BLAST input page, since this is not a default parameter that can be included in BLAST search.

Q: Hello, I have NirK sequences extracted from my metagenome and I would like to determine taxonomy for each gene. I guess I would use blastp but I have hundreds of sequences. Is it possible to determine taxonomy in IMG or do I have to download sequences and classify them via some other program?
A: We assume your metagenome has not been submitted to IMG. You have several options: 1) if your metagenome is assembled, you can submit it to IMG, in which case IMG will generate the best LAST hits for all proteins in your metagenome including NirK, as well as attempt to assign the scaffold “lineage” (i. e. predicted taxonomy) to all scaffolds, including the scaffolds on which NirK is found. Alternatively, you can run multiple queries using Find Genes -> BLAST -> All isolates. The query is limited to 10,000 characters, so you can submit multiple sequences as multi-fasta. That said, even for deeply sequenced, but assembled metagenomes, we haven’t seen instances of hundreds of NirK genes. Are you sure it’s not amplicon data or unassembled WGS sequences? If the latter, taxonomic assignment won’t be terribly accurate, so it may be better to try and assemble the data. The correct way is to curate a reference set of established closely related NirK sequences and align and tree with your metagenome candidates.

Q: Will we be receiving emails for the next webinars
A: Yes, you will

Q: When looking for a gene product, if I’m not sure about the name (nodule or nodulation perhaps) can I type nodul# to allow anything starting with nodul?
A: Yes, every search term is treated as a partial search – there is no need for “#” or anything else.

Q : is there a way to get the exported fasta file as a download and not just the text in a webpage?
A: Yes, save genes to the workspace gene set. Export genes from here.

Q: Is there a recommend tool to assess phylogenetic diversity or similarity between all predicted BGCs within a collection of bacteria, eg. a genus or species of bacteria?
A: Please explore tools and content in IMG-ABC, a data mart dedicated to biosynthetic gene clusters (BGC) for secondary metabolites.

Q: How to blast with multiple queries?
A: You can submit more than one sequence in the query window. However, there is a 10,000 character limit on the total length of the query. If you want to run BLAST with multiple sequences totaling more than 10,000 characters (including fasta headers), you may have to split it into 10,000 character chunks.

Q: can we get a certificate of participation for these seminars
A:, sorry, we don’t give certificates for webinars. But if you attend an in-person MGM workshop at JGI, you will get a certificate.