Genebee BLAST 2.2.8 Services Help

Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.genebee.msu.ru/blast_new/bh_faq.html
Дата изменения: Thu Feb 26 23:05:50 2004
Дата индексирования: Tue Oct 2 01:14:08 2012
Кодировка:

BLAST Frequently Asked Questions (FAQ)

Why do I get the "No Significant similarity found" message?
Why does my search timeout on the BLAST servers?
Why do I get the error message "ERROR: BLASTSetUpSearch: Unable to calculate Karlin-Altschul params, check query sequence"?
Why do I get the error message "ERROR: Blast: No Valid Letters to be indexed"?
Why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?
Is it possible to search for a motif or pattern with BLAST?
How do I perform a similarity search with a short peptide/nucleotide sequence?
Can I use BLAST to compare to two or more sequences in a multiple sequence alignment?
What is low-complexity sequence?
What is the Expect (E) value?
Which BLAST program should I use?
How can I see low-similarity matches when there are many strong hits to my query sequence?
How can I limit my BLAST search based on Organism?
How can I search a batch of sequences with BLAST?
Other Resources

Q: Causes for "No significant similarity found"

Below are several reasons that a BLAST search can result in the "No significant similarity found" message. Note: You may need to use more than one of these options at the same time (example: increase the Expect value AND turn off filtering).

Short Sequences: Depending on sequence composition, a short sequence is a sequence under 20 residues

Try increasing the Expect value.
You may also need to decrease the Word Size from the default (11 for nucleotides or 3 for proteins). You can decrease the word size using the -W option in the Other Advanced Options Box. Example: -W 2

You should also consult the "How do I perform a similarity search with a short peptide/nucleotide sequence?" section below.

Filtering: BLAST filters regions of low-complexity (for a description of low-complexity see What is low-complexity sequence? below). If you sequence contains large regions of "low complexity" it may not significant hits to the database. You can turn off filtering by checking off all flags in the "Filter" options section.

Query Format: Another reason you may see the "No Significant Similarity found" message is using the wrong type of sequence in your search.

FASTA. Check that you have the Input Data set to the correct format for your Query. For more information on FASTA format, click here.
Sequence type and Program combination. You can search with an amino acid query sequence using the blastp and tblastn programs. With nucleotide query sequences you can use blastn, blastx, and tblastx.

For more information on the BLAST programs, click here.

Q: Why does my search timeout on the BLAST servers?

Certain combinations of BLAST searches with large sequences against large databases can cause the BLAST servers to timeout. This has to do with a limit on the server CPU's which prevents sequences which generate many HSPs from hoarding server resources.

However there are some things you can do to prevent timeout and generate results from large sequences.

Increase the Word Size to 20 - 25. With a default Word Size of 7, the BLAST algorithm finds initial HSPs of 7 bases in length and begins extension of these from either end. In a large sequence this can generate 100's of initial HSPs between the query sequence and even a single large genomic sequence in the databases. Increasing the Word Size to 25 makes the initial HSP smaller, limiting the number small initial fragments to be extended.
Decrease the Expect value to 1.0 or lower. Many hits from large sequences are to many small fragments in the database. The expect value for these searches is such that decreasing the expect value will eliminate these results, and concentrate on results which are more likely to contain large coding regions and genomic fragments.

If you are still seeing a "timeout" error message after making the above changes, please contact nik@genebee.msu.su with the description of your search.

Q: Why do I get the message "ERROR: BLASTSetUpSearch: Unable to calculate Karlin-Altschul params, check query sequence"?

This will happen if your entire query sequence has been masked by low complexity filtering. You will need to turn filtering off to get hits. For further information on filtering, please read the sections of the BLAST FAQs on Q: What is low-complexity sequence? and also Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?

Q: Why do I get the message "ERROR: Blast: No valid letters to be indexed"?

You can see this error message if too many ambiguity codes (R,Y,K,W,N, etc. for nucleotides) are present in your query sequence. Although BLAST allows ambiguity codes, be aware that these will always contribute a negative score in nucleic acid searches. Thus, sequences such as degenerate PCR primers with ambiguity codes may not find any significant hits even though they may be designed from sequences that are present in the database.

Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?

You are seeing the result of automatic filtering of your query for low-complexity sequence that is performed to prevent artifactual hits. The filter substitutes any low-complexity sequence that it finds with the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") or the letter "X" in protein sequences (e.g., "XXXXXXXXX"). Low-complexity regions can result in high scores that reflect compositional bias rather than significant position-by-position alignment (Wootton & Federhen, 1996). Filter programs can eliminate these potentially confounding matches from the blast reports, leaving regions whose BLAST statistics reflect the specificity of their parities alignment. Queries searched with the blastn program are filtered with DUST. The other BLAST programs use SEG.

You can change the default and remove these filters if you like. On the BLAST Web interface you will see a checkbox to click that will remove the filter.

Q: Is it possible to search for a motif or pattern with BLAST?

You can search with short query sequences using BLAST after changing a few parameters (see " Q: How do I perform a similarity search with a short peptide/nucleotide sequence?" above). You may also be interested in checking out other molecular biology web sites, such as those mentioned in the Other Resources section at the end of this FAQ, for motif searching software.

Q: How do I perform a similarity search with a short peptide/nucleotide sequence?

First, you will probably need to increase the Expect (E) value in your search. A short query is more likely to occur by chance in the database. Therefore, even a perfect match can have low statistical significance and may not be reported. Increasing the E value allows you to look farther down in the hit list and see matches that would normally be discarded because of low statistical significance.

For most searches, an Expect value up to 1000 is enough to see results. However, you can raise the E value farther by typing -e 10000, for example, in the Other Advanced Options Box.

If you still do not get results after increasing the E value, you may want to try decreasing the Word size (W), another parameter that becomes important with a short query. The BLAST algorithm uses "words" to nucleate regions of similarity. The default Word size for a protein sequence is 3 residues and for nucleotide sequences it is 11 bp. A blastn search will not work with a Word size of less than 7. A good rule of thumb is that the query length must be at least twice the Word size. For example, if your query is a protein sequence of 4 residues, than the Word size should be reduced to 2. Please note that the smaller the Word size, the slower your search will be.

You can lower the default word size in the Other Advanced Options, type -W some_number (for example, -W 9).

Sometimes a short query does not produce results because it contains low-complexity sequence. Often this type of sequence can be recognized by the human eye because it looks very redundant, for example the protein sequence PADPPPDPPPP or the nucleotide sequence AAATTTAAAAAT. A filter for low complexity sequence is applied by default to BLAST nucleotide and protein searches. If your query has regions of low-complexity sequence, then large portions of your query may be filtered out, essentially making your query shorter than you might have expected. Removing the filter will help in these cases.

Finally, you can change the matrix to optimize for searching with short protein sequences. For information on query length and the matrix see this document.

Q: Can I use BLAST to compare to two or more sequences in a multiple sequence alignment?

You can use the BLAST 2.2.8 Sequences service to compare two nucleotide or two protein sequences against each other using the BLAST 2.2.8 algorithm. The BLAST 2.2.8 algorithm performs a Gapped BLAST search between the two sequences allowing for the introduction of gaps (deletions and insertions) in the resulting alignment. At this time only the blastn and blastp programs are available. Using sequences greater than 150 Kb is not recommended.

To compare one sequence against a specific sequence or set of sequences, you can also use a separate multiple sequence alignment program. There are many such software tools available to do this. NCBI has developed a tool, MACAW, which will do multiple sequence alignments on PC or Mac platforms. The latest version of MACAW is available on the NCBI anonymous FTP site (ftp://ncbi.nlm.nih.gov) under /pub/macaw/. The instructions are included with the program. You may also be interested in checking out other molecular biology web sites, such as those mentioned in the Other Resources section at the end of this FAQ.

Q: What is low-complexity sequence?

Regions with low-complexity sequence have an unusual composition and this can create problems in sequence similarity searching ( Wootton & Federhen, 1996). Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause artifactual hits (please also see Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?)

In BLAST searches performed without a filter, often certain hits will be reported with high scores only because of the presence of a low-complexity region. Most often, this type of match cannot be thought of as the result of homology shared by the sequences. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.

Q: What is the Expect (E) value?

The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences.

The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.

In BLAST 2.2.8, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance.

Q: Which BLAST program should I use?

You have many choices to make between different BLAST programs and how to access them. Please see the Overview for more information on this topic. The easiest way to search is to use the BLAST Web pages. There are many additional parameters that can be controlled, but for a basic search, the default options work well.

Q: How can I see low-similarity matches when there are many strong hits to my query sequence?

Often, when the query is a member of a large sequence family, the summary hit list and the alignments returned only contain very high scoring hits. To look at low-similarity matches, you must increase the maximum number of results returned. On the BLAST Web pages, often it is sufficient to increase the size of the summary hit list and the number of alignments shown using the menus on the Blast Web pages. However, it is possible to increase the lists even further using the Other Advanced Options box on the Web BLAST pages. For BLAST 2.2.8, "-v 2000", for example, will increase the number of descriptions returned in the summary hit list to 2000. The option "-b 2000" will similarly increase the number of alignments returned.

Q: How do I perform a similarity search with a short peptide/nucleotide sequence?

For most searches, an Expect value up to 1000 is enough to see results. However, you can raise the E value farther on the BLAST Web page by typing -e 10000, for example, in the Other Advanced Options Box.

You can lower the default word size on the BLAST Web page. In the Other Advanced Options, type -W some_number (for example, -W 9).

Finally, you can change the matrix to optimize for searching with short protein sequences. For information on query length and the matrix see this document.

Q: How can I limit my BLAST search based on Organism?

The option to limit a search to organism and even taxonomic classification is now available in BLAST 2.2.8. There is a editor field to input the species name, or classification (example: "eubacteria").

Q: How can I search a batch of sequences with BLAST?

There are basically two ways to run Batch BLAST searches.

Install the BLAST 2.0 Server Locally:
You can install the Standalone BLAST 2.0 server on your own machine if you have a Windows or a UNIX platform. Installing this executable allows you to search local databases as well as public ones that you have downloaded and installed. Standalone BLAST 2.2.8 also allows you to do gapped BLAST and PSI-BLAST searches. The Standalone BLAST 2.2.8 server and its new capabilities are described in Altschul et al., 1997. There is information about Standalone BLAST in the "Overview" available from the sidebar of the main BLAST page and also here.
There is also some information on setting up the programs at the NGHRI site at:
http://genome.nhgri.nih.gov/blastall/blast_install/
The Standalone executables are available at the anonymous FTP location.
Install the BLAST 2.0 Network client software locally:
The BLAST 2.0 Network client will allow you to submit a file of FASTA sequences over an internet connection to the NCBI BLAST databases. The BLAST Network client executables are located here. There are executables for Mac, PC, and various UNIX platforms.

Other Resources:

The on-line BLAST Course was written by Dr. Stephen Altschul and discusses the basics of the Gapped BLAST algorithm. In addition the full text of the 1997 Nucleic Acids Research paper "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs" is also available on-line.

Chapter 7 of Cold Spring Harbor Genome Analysis Laboratory Manual also provides helpful introductory information for users of molecular biology databases and software. This chapter is available over the WWW or from the Cold Spring Harbor Laboratory WWW home page under CSHL Press.

There are many sites which offer software tools for molecular biologists and for manipulating sequence data. Some of the larger of these are listed below:

European Bioinformatics Institute (EBI) BioCatalog:
http://www.ebi.ac.uk/biocat
Indiana University IUBio Archive:
http://iubio.bio.indiana.edu
Pedro's BioMolecular Research Tools:
http://www.public.iastate.edu/~pedro/research_tools.html