Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://www.genebee.msu.ru/blast_new/bh_faq.html
Дата изменения: Thu Feb 26 23:05:50 2004 Дата индексирования: Tue Oct 2 01:14:08 2012 Кодировка: |
Short Sequences: Depending on sequence composition, a short sequence is a sequence under 20 residues
Filtering: BLAST filters regions of low-complexity (for a description of low-complexity see What is low-complexity sequence? below). If you sequence contains large regions of "low complexity" it may not significant hits to the database. You can turn off filtering by checking off all flags in the "Filter" options section.
Query Format: Another reason you may see the "No Significant Similarity found" message is using the wrong type of sequence in your search.
However there are some things you can do to prevent timeout and generate results from large sequences.
You can change the default and remove these filters if you like. On the BLAST Web interface you will
see a checkbox to click that will remove the filter.
Q: Is it possible to search for a motif or pattern with BLAST?
You can search with short query sequences using BLAST after changing a few parameters (see "
Q: How do I perform a similarity search with a short peptide/nucleotide sequence?" above). You may
also be interested in checking out other molecular biology web sites, such as those mentioned in the
Other Resources section at the end of this FAQ, for motif searching software.
Q: How do I perform a similarity search with a short peptide/nucleotide sequence?
First, you will probably need to increase the Expect (E) value in your search. A short query is more
likely to occur by chance in the database. Therefore, even a perfect match can have low statistical
significance and may not be reported. Increasing the E value allows you to look farther down in the
hit list and see matches that would normally be discarded because of low statistical significance.
For most searches, an Expect value up to 1000 is enough to see results. However, you can raise the E value farther by typing -e 10000, for example, in the Other Advanced Options Box.
If you still do not get results after increasing the E value, you may want to try decreasing the Word size (W), another parameter that becomes important with a short query. The BLAST algorithm uses "words" to nucleate regions of similarity. The default Word size for a protein sequence is 3 residues and for nucleotide sequences it is 11 bp. A blastn search will not work with a Word size of less than 7. A good rule of thumb is that the query length must be at least twice the Word size. For example, if your query is a protein sequence of 4 residues, than the Word size should be reduced to 2. Please note that the smaller the Word size, the slower your search will be.
You can lower the default word size in the Other Advanced Options, type -W some_number (for example, -W 9).
Sometimes a short query does not produce results because it contains low-complexity sequence. Often this type of sequence can be recognized by the human eye because it looks very redundant, for example the protein sequence PADPPPDPPPP or the nucleotide sequence AAATTTAAAAAT. A filter for low complexity sequence is applied by default to BLAST nucleotide and protein searches. If your query has regions of low-complexity sequence, then large portions of your query may be filtered out, essentially making your query shorter than you might have expected. Removing the filter will help in these cases.
Finally, you can change the matrix to optimize for searching with short protein sequences. For information
on query length and the matrix see
this document.
Q: Can I use BLAST to compare to two or more sequences in a multiple sequence alignment?
You can use the BLAST 2.2.8 Sequences service
to compare two nucleotide or two protein sequences against each other using the BLAST 2.2.8 algorithm.
The BLAST 2.2.8 algorithm performs a Gapped BLAST search between the two sequences allowing for the
introduction of gaps (deletions and insertions) in the resulting alignment. At this time only the
blastn and blastp programs are available. Using sequences greater than 150 Kb is not recommended.
To compare one sequence against a specific sequence or set of sequences, you can also use a separate
multiple sequence alignment program. There are many such software tools available to do this. NCBI has
developed a tool, MACAW, which will do multiple sequence alignments on PC or Mac platforms. The
latest version of MACAW is available on the NCBI
anonymous FTP site (ftp://ncbi.nlm.nih.gov) under /pub/macaw/. The instructions are included with the
program. You may also be interested in checking out other molecular biology web sites, such as those
mentioned in the Other Resources section at the end of this FAQ.
Q: What is low-complexity sequence?
Regions with low-complexity sequence have an unusual composition and this can create problems in
sequence similarity searching (
Wootton & Federhen, 1996). Low-complexity sequence can often be recognized by visual inspection.
For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide
sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause
artifactual hits (please also see Q: After running a search why do I see a string of "X"s
(or "N"s) in my query sequence that I did not put there?)
In BLAST searches performed without a filter, often certain hits will be reported with high scores only
because of the presence of a low-complexity region. Most often, this type of match cannot be thought of
as the result of homology shared by the sequences. Rather, it is as if the low-complexity region is
"sticky" and is pulling out many sequences that are not truly related.
Q: What is the Expect (E) value?
The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by
chance when searching a database of a particular size. It decreases exponentially with the Score (S)
that is assigned to a match between two sequences. Essentially, the E value describes the random
background noise that exists for matches between sequences.
The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.
In BLAST 2.2.8, the Expect value is also used instead of the P value (probability) to report the significance
of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a
database of the current size one might expect to see 1 match with a similar score simply by chance.
Q: Which BLAST program should I use?
You have many choices to make between different BLAST programs and how to access them. Please see the
Overview for more information on this topic. The easiest way to search
is to use the BLAST Web pages. There are many additional parameters that can be controlled, but for a
basic search, the default options work well.
Q: How can I see low-similarity matches when there are many strong hits to my query sequence?
Often, when the query is a member of a large sequence family, the summary hit list and the alignments
returned only contain very high scoring hits. To look at low-similarity matches, you must increase the
maximum number of results returned. On the BLAST Web pages, often it is sufficient to increase the
size of the summary hit list and the number of alignments shown using the menus on the Blast Web pages.
However, it is possible to increase the lists even further using the
Other Advanced Options box on the Web BLAST pages. For BLAST 2.2.8, "-v 2000", for example, will
increase the number of descriptions returned in the summary hit list to 2000. The option "-b 2000"
will similarly increase the number of alignments returned.
Q: How do I perform a similarity search with a short peptide/nucleotide sequence?
First, you will probably need to increase the Expect (E) value in your search. A short query is more
likely to occur by chance in the database. Therefore, even a perfect match can have low statistical
significance and may not be reported. Increasing the E value allows you to look farther down in the
hit list and see matches that would normally be discarded because of low statistical significance.
For most searches, an Expect value up to 1000 is enough to see results. However, you can raise the E value farther on the BLAST Web page by typing -e 10000, for example, in the Other Advanced Options Box.
If you still do not get results after increasing the E value, you may want to try decreasing the Word size (W), another parameter that becomes important with a short query. The BLAST algorithm uses "words" to nucleate regions of similarity. The default Word size for a protein sequence is 3 residues and for nucleotide sequences it is 11 bp. A blastn search will not work with a Word size of less than 7. A good rule of thumb is that the query length must be at least twice the Word size. For example, if your query is a protein sequence of 4 residues, than the Word size should be reduced to 2. Please note that the smaller the Word size, the slower your search will be.
You can lower the default word size on the BLAST Web page. In the Other Advanced Options, type -W some_number (for example, -W 9).
Sometimes a short query does not produce results because it contains low-complexity sequence. Often this type of sequence can be recognized by the human eye because it looks very redundant, for example the protein sequence PADPPPDPPPP or the nucleotide sequence AAATTTAAAAAT. A filter for low complexity sequence is applied by default to BLAST nucleotide and protein searches. If your query has regions of low-complexity sequence, then large portions of your query may be filtered out, essentially making your query shorter than you might have expected. Removing the filter will help in these cases.
Finally, you can change the matrix to optimize for searching with short protein sequences. For information
on query length and the matrix see this
document.
Q: How can I limit my BLAST search based on Organism?
The option to limit a search to organism and even taxonomic classification is now available in BLAST 2.2.8.
There is a editor field to input the species name, or classification (example: "eubacteria").
Q: How can I search a batch of sequences with BLAST?
There are basically two ways to run Batch BLAST searches.
You can install the Standalone BLAST 2.0 server on your own machine if you have a Windows or a UNIX platform. Installing this executable allows you to search local databases as well as public ones that you have downloaded and installed. Standalone BLAST 2.2.8 also allows you to do gapped BLAST and PSI-BLAST searches. The Standalone BLAST 2.2.8 server and its new capabilities are described in Altschul et al., 1997. There is information about Standalone BLAST in the "Overview" available from the sidebar of the main BLAST page and also here.
There is also some information on setting up the programs at the NGHRI site at:
http://genome.nhgri.nih.gov/blastall/blast_install/
The Standalone executables are available at the anonymous FTP location.
The BLAST 2.0 Network client will allow you to submit a file of FASTA sequences over an internet connection to the NCBI BLAST databases. The BLAST Network client executables are located here. There are executables for Mac, PC, and various UNIX platforms.
Chapter 7 of Cold Spring Harbor Genome Analysis Laboratory Manual also provides helpful introductory information for users of molecular biology databases and software. This chapter is available over the WWW or from the Cold Spring Harbor Laboratory WWW home page under CSHL Press.
There are many sites which offer software tools for molecular biologists and for manipulating sequence data. Some of the larger of these are listed below: