Please cite as: CSH Protocols; 2007; doi:10.1101/pdb.top17
| Topic Introduction |
Adapted from "Sequence Database Searching for Similar Sequences," Chapter 6, in Bioinformatics: Sequence and Genome Analysis, 2nd edition, by David W. Mount. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, USA, 2004.
INTRODUCTION
The BLAST algorithm was developed as a way to perform DNA and protein sequence similarity searches by an algorithm that is faster than FASTA but considered to be equally as sensitive. Both of these methods follow a heuristic (tried-and-true) method that almost always works to find related sequences in a database search, but does not have the underlying guarantee of an optimal solution like the dynamic programming algorithm. FASTA finds short common patterns in query and database sequences and joins these into an alignment. BLAST is similar to FASTA, but gains a further increase in speed by searching only for rarer, more significant patterns in nucleic acid and protein sequences. BLAST is very popular due to its availability on the World Wide Web through a large server at the National Center for Biotechnology Information (NCBI) and at many other sites. The BLAST algorithm has evolved to provide molecular biologists with a set of very powerful search tools that are freely available to run on many computer platforms. This article is intended to be a "users guide" to the principles underlying BLAST.
RELATED INFORMATION
Articles that describe Strategies for Sequence Similarity Database Searches, Using a FASTA Sequence Database Similarity Search, Recommended Steps for a FASTA Search, and Steps Used by the BLAST Algorithm are also available.
OVERVIEW
Like FASTA, the BLAST algorithm increases the speed of sequence alignment by searching first for common words or k-tuples in the query sequence and each database sequence. Whereas FASTA searches for all possible words of the same length, BLAST confines the search to the words that are the most significant. For proteins, significance is determined by evaluating these word matches using log odds scores in the BLOSUM62 amino acid substitution matrix. For the BLAST algorithm, the word length is fixed at 3 (formerly 4) for proteins and 11 for nucleic acids (three if the sequences are translated in all six reading frames). These lengths are the minimum needed to achieve a word score that is high enough to be significant but not so long as to miss short but significant patterns. FASTA theoretically provides a more sensitive search of DNA sequence databases because a shorter word length may be used.
The BLAST algorithm has gone through several developmental stages. The most recent gapped BLAST, or BLAST2, is recommended, in part because older versions of BLAST are slower and reported to overestimate the significance of database matches (Brenner et al. 1998). The most important recent change is that BLAST reports the significance of a gapped alignment of the query and database sequences. Former versions reported several ungapped alignments, and it was more difficult to evaluate their overall significance.
SEQUENCE FILTERING
Low-complexity regions have fewer sequence characters in them because of repeats of the same sequence character or pattern. These sequences produce artificially high-scoring alignments that do not accurately convey sequence relationships in sequence similarity searches. Regions of low complexity or repetitive sequences may be readily visualized in a dot matrix analysis of a sequence against itself. Low-complexity regions with a repeat occurrence of the same residue can appear on the matrix as horizontal and vertical rows of dots representing repeated matches of one residue position in one copy of the sequence against a series of the same residue in the second copy. Repeats of a sequence pattern appear in the same matrix as short diagonals of identity that are offset from the main diagonal. Such sequences should be excluded from sequence similarity searches.
The BLAST programs include a feature for filtering the query sequence through programs that search for low-complexity regions. Filtering is applied only to the query sequence and not to the database sequences. Low-complexity regions are marked with an X (protein sequences) or N (nucleic acid sequences) and are then ignored by the BLAST program. Removing low-complexity and repeat sequences increases emphasis on the more significant database hits. The NCBI programs SEG and PSEG are used to mask amino acid sequences, and NSEG is used to mask nucleic acid sequences (Wootten and Federhen 1993, 1996). The SEG programs are available by anonymous FTP from ftp://ftp.ncbi.nih.gov/pub/seg/, including documentation. The program DUST is also used for DNA sequences (see Filter under BLAST Search Parameters at http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml). RepeatMasker (described later) is another program for this same purpose.
The compositional complexity in a window of sequence of length L is given by (Wootten and Federhen 1996):
![]() |
where N is 4 for nucleic acid sequences and 20 for protein sequences, and
are the numbers of each residue in the window. K will vary from 0 for very low complexity to 1 for high complexity. Thus, complexity is given by:
Compositional complexities are sometimes calculated to produce K scores in bit units of logarithms to the base 2. A sliding window (usually 12 residues) is moved along the sequence, and the complexity is calculated at each position. Regions of low complexity are identified using Equation 1, neighboring low-complexity regions are then joined into longer regions, and the resulting region is then reduced to a single optimal segment by a minimization procedure. The SEG program is used for analysis of either proteins or nucleic acids by the above methods. PSEG and NSEG are similar to SEG but are set up for analysis of protein and nucleic acid sequences, respectively. These versatile programs may also be used for locating specific sequence patterns that are characteristic of exons or protein structural domains. In database searches involving comparisons of genomic DNA sequences with EST sequence libraries, use of repeat masking is important for filtering output to the most significant matches because of the presence of a variety of repetitive sequences ranging from mononucleotide repeats to larger repeated elements in genomes (Claverie 1996).
In addition to low-complexity regions, BLAST will also filter out repeat elements (such as human SINE and LINE retroposons). Another filtering program for repeats of periodicity <10 residues called XNU (Claverie and States 1993) is used by the BLAST stand-alone programs, but is not available on the NCBI server.
Another important Web server, RepeatMasker (http://ftp.genome.washington.edu/), screens sequences for interdispersed repeats known to be present in mammalian genomes and also can filter out low-complexity regions (A.F.A. Smeet and P. Green, see Web site above). A dynamic programming search program, cross-match (P. Green, see Web site), performs a search of a repeat database with the query sequence (Claverie 1996). A database of repetitive elements (Repbase) maintained at http://www.girinst.org by the Genetics Information Research Institute (Jurka 1998) can also be used for this purpose.
OTHER BLAST PROGRAMS AND OPTIONS
The extensive set of BLAST resources, including programs and sequence databases, is fully described on the NCBI site map at http://www.ncbi.nlm.nih.gov/Sitemap/index.html. There are a number of variations of the BLAST program for comparing either nucleic acid or protein query sequences with nucleic acid or protein sequence databases. If necessary, the programs translate nucleic acid sequences in all six possible reading frames to compare them to protein sequences.
These BLAST programs are shown in table 1 along with the types of alignment, gapped or ungapped, that they produce. Table 2 lists the databases available, and Table 3 lists the options and parameter settings available on the BLAST server. These various options are also described on the main BLAST Web page at http://www.ncbi.nlm.nih.gov/BLAST.
OTHER BLAST-RELATED PROGRAMS
BLAST-Enhanced Alignment Utility (BEAUTY)
BEAUTY adds additional information to BLAST search results, including figures summarizing the information on the locations of HSPs and any known protein domains and sites such as PFAM domains and Prosite patterns that are present in the matching database sequences (Worley et al. 1995). To make this enhanced type of analysis possible, a database of domains and sites was created for use with the BEAUTY program. A new database of sequence domains and sites was made showing for each sequence in Entrez the possible location of patterns in the Prosite catalog, the BLOCKS database, and the PRINTS protein fingerprint database. The BEAUTY program is accessible on the BCM Search Launcher (http://searchlauncher.bcm.tmc.edu/).
BLAST Searching with a Cobbler Sequence
The BLOCKS server (http://blocks.fhcrc.org) offers a variety of BLAST searches that use as a query sequence a consensus sequence derived from multiple sequence alignment of a set of related proteins. This consensus sequence, called a Cobbler sequence (Henikoff and Henikoff 1997), is used to focus the search on residues that are in the majority in each column of the multiple sequence alignment, rather than on any one particular sequence. Hence, the search may detect additional database sequences with variation unlike that found in the original sequences, yet still representing the same protein family.
BLAST2
This program uses the BLASTP or BLASTN algorithms for aligning two sequences and may be reached on the BLAST server site at NCBI. This program is useful for aligning very long sequences, but sequences >150 kb are not recommended.
REFERENCES
Brenner, S.E., Chothia, C., and Hubbard, T.J. 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. 95: 6073–6078.
Claverie, J.-M. 1996. Effective large-scale sequence similarity searches. Methods Enzymol. 266: 212–227.[Medline]
Claverie, J.-M. and Makalowski, X. 1994. Alu alert. Nature 371: 752.[Medline]
Claverie, J.-M. and States, D.J. 1993. Information enhancement methods for large scale sequence analysis. Comput. Chem. 17: 191–201.
Claverie, J.-M. and States, D.J. 1997. Embedding strategies for effective use of information from multiple sequence alignments. Protein Sci. 6: 698–705.[Abstract]
Jurka, J. 1998. Repeats in genomic DNA, mining and meaning. Curr. Opin. Struct. Biol. 8: 333–337.[Medline]
Wootton, J.C. and Federhen, S. 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17: 149–163.
Wootton, J.C. and Federhen, S. 1996. Analysis of compositionally biased regions in sequences. Methods Enzymol. 266: 554–571.[Medline]
Worley, K.C., Wiese, B.A., and Smith, R.F. 1995. BEAUTY: An enhanced BLAST-based search tool that integrates multiple biological information resources into sequence similarity search results. Genome Res. 5: 173–184.
Related Articles
This article has been cited by other articles:
![]() |
D. W. Mount Studies of Varying Alignment Algorithm, Amino Acid Scoring Matrix, and Gap Penalties CSH Protocols, June 1, 2008; 2008(7): pdb.ip60 - pdb.ip60. [Abstract] [Full Text] |
||||
![]() |
D. W. Mount Using PAM Matrices in Sequence Alignments CSH Protocols, June 1, 2008; 2008(7): pdb.top38 - pdb.top38. [Abstract] [Full Text] |
||||
![]() |
D. W. Mount Using Gaps and Gap Penalties to Optimize Pairwise Sequence Alignments CSH Protocols, June 1, 2008; 2008(7): pdb.top40 - pdb.top40. [Abstract] [Full Text] |
||||
Copyright © 2007 by Cold Spring Harbor Laboratory Press. Online ISSN: 1559-6095 Terms of Service |