BLAST Lookup Table Phase

The initial stage of every pairwise alignment in BLAST, independent of the precise BLAST program run, involves a hashtable-based matching of small regions of the query sequence to small regions of the database sequence. The following is a guided tour of this process, along with a description of the optimizations used to make it fast.

The big reason that dynamic programming alignment of sequences can rigorously find all high-scoring alignments is that every letter of the query sequence is tested against every letter of a database sequence. This is also responsible for the number of operations being proportional to the product of the two sequence lengths.

BLAST achieves lower asymptotic complexity on average because it makes a simplifying assumption: any alignment of interest between query and database sequence will contain at least one region of W consecutive letters that is high-scoring. Making the value of W too small increases the work, and making W too large dramatically reduces the work but also causes most alignments to be missed. Once W is fixed, the query sequence can be preprocessed so that the time needed to find these tiny alignments ('seeds') is reduced when scanning a large number of database sequences.

For example, consider the ungapped alignment:

Query: 6 ALWDYEPQNDDELPMKEGDCMTI 28
         AL+DY+ + ++++ +  GD +T+
Sbjct: 6 ALYDYKKEREEDIDLHLGDILTV 28

With W=3, the alignment contains several triplets of amino acids whose alignment achieves a high score. Here are the high-scoring 3-letter sub-alignments ('words' in the source code, i.e. groups of letters):

offset	query	subject	score
6	ALW	ALY	10
7	LWD	LYD	12
8	WDY	YDY	15
9	DYE	DYK	14
10	YEP	YKK	8
22	EGD	LGD	9
23	GDC	GDI	11
26	MTI	LTV	10

Given the capability to pick out these (and other high-scoring) words very quickly from the original sequences, and the ability to notice that out of all the words found the above group of words clusters together, it should be possible to construct the above alignment without having to test every combination of letters between the two sequences. This is exactly what BLAST does.

Notice that none of the words that achieve a high score in the alignment are completely identical. All of the aligned words are 'neighbors', i.e. contain some exact matches and some matches to which BLOSUM62 assigns a good score. Neighboring words are a good idea if the alphabet is large, but for nucleotide sequences the alphabet is only four letters. In this case the words involved must be longer, since alignments of letters drawn from a small alphabet will contain chance strings of consecutive matches more often. Also, it's customary to only consider matches or mismatches for nucleotide words. The table-driven methods described below work equally well for finding neighboring words or for finding exact matches between words.

For nucleotide BLAST, the lookup table only deals with exact matches. Even if later stages actually use a score matrix when aligning nucleotides, the initial phase that involves matching up query words with database words only considers words to match if all letters are identical.

Use of Lookup Tables

Given the query to a BLAST search, the first step is to isolate all the W-letter words in the query sequence, and the offset into the sequence where each word occurs. These offsets are what is tabulated. Later, when a database sequence appears, the task is walk through each offset of the database sequence and to recover from the lookup table the offsets of query words that score well when aligned at that database offset. The collection of [query_offset, db_seq_offset] pairs is the output of the 'DB scanning' phase (described later).

If the alphabet has A letters, then the lookup table is of size at least A^W. Basically, each possible W-letter word hashes (via some 1-to-1 function) to a unique location in the lookup table. The current hash function is to consider each letter as a number from 0 to (A-1) which can fit in B bits (=ceil(log₂(A))), and to concatenate the integer values of W letters together to form a B*W-bit value. This value is the index into the lookup table that corresponds to the given W-letter word. Typical parameters for BLAST searches are listed below:

program	alphabet size	W	B	number of hashtable cells
blastp, blastx, tblastn, tblastx	28	3	5	32768
blastn	4	8	2	65536 (see below)
megablast	4	11 or 12	2	2²², 2²⁴ (see below)

The fact that these lookup tables are quite large is the basis for many of the performance optimizations present in the code. Each entry of the lookup table must point to a list of numbers, whose size and contents are determined when the query sequence is preprocessed.

The procedure for filling the lookup table is as follows:

for (i = 0; i <= query_length - W; i++) {
 query_word = [query[i],query[i+1],...query[i+W-1] ]
 /* neighboring loop */
 for all W-letter combinations of A characters {
   db_word = current list of letters
   if (db_word exactly matches query_word ||
     score of aligning query_word with db_word >= T) {
    add i to list of query offsets in lookup_table[HASH(db_word)]
   }
 }
}

The quantity T is the 'neighboring threshold', a measure of how good the word alignment must be in order for its associated query offset to make it into the lookup table. Basically, for the i^th query word the above adds offset i to the lookup table entries associated with any possible word a database sequence may contain, that is sufficiently similar to word i of the query sequence. For nucleotide searches, T is always infinite. Note that 'word i' above means the query word starting at letter i of the query. All BLAST lookup tables index words by their starting offset. This is a reversal of the old engine, which indexes words by their ending offset.

Later, given a subject sequence, for subject offset j we form the W-letter word at that offset, and copy out all of the query offsets in lookup_table[HASH(word j)]. The list of offset pairs (list[0],j), (list[1],j), etc. then is exactly the list of seed alignments between the query and database sequence. Especially when the query is short (i.e. has few words to index) and the database is large (i.e. the time needed to create the lookup table is amortized over searching many database sequences) the scanning process is tremendously faster than a brute force attempt to align all of every sequence. It finds all seeds with score >= T in time that essentially depends linearly on the database size (on average; the worst-case running time is still proportional to the product of query and database size).

The HASH() function is very simple, since the lookup table is large enough to not need a complicated function to reduce the chance of collisions. Basically it involves concatenating the bits that make up the word to be hashed, then using that to index directly into the table. For protein searches using concatenated letters drawn from a 28-letter alphabet, each letter is embedded in 5 bits and the set of 5-bit quantities is concatenated to form the lookup table index. For nucleotide searches the lookup table is typically 8-12 letters wide, and HASH() is thus a concatenation of 8-12 values, each two bits wide. Note that in the protein case this means that large expanses of the lookup table will never be used, because a protein letter will never have a value of 27-31 that its 5-bit representation allows. The code currently does not take advantage of that fact to compress the lookup table, and maybe it should. In any case, one small optimization for protein lookup tables is that when allocating the memory for the table, the new engine allocates 32 * 32 * ... * 32 * 28 entries. HASH() is unchanged, but the last letter doesn't have to account for the 'hole' in the table after it, so that the lookup table will have 28k cells instead of the full 32k.

One final complication for nucleotide lookup tables: since HASH() expects letters in a word to be exactly two bits wide, ambiguities cannot be handled by the lookup table. When the table is built, query words that contain an ambiguity do not contribute any lookup table entries, and a database sequence has all of its ambiguities converted into one of the four bases before the BLAST code ever sees it. The ambiguities are added back in at a much later stage in the alignment process.

Lookup tables are constructed by LookupTableWrapInit (lookup_wrap.c). Lookup table construction proceeds in two phases for both protein and nucleotide searches. The first constructs a generic lookup table by repeated calls to BlastLookupAddWordHit (BLAST_lookup.c). The generic lookup table is simply an array of pointers (the 'backbone') to lists of 32-bit integers, with one list per hashtable cell. Each list has one element that lists the number of words allocated for that list, one word that gives the number of query offsets in the list, and then the collection of query offsets. The list for each hashtable cell grows dynamically, and when the query is large this sort of layout saves half the memory that the old BLAST engine would need.

Once the generic lookup table is filled, it is then packed into an optimized structure. The type of optimized structure depends on several factors, and each type is described below. Note that a megablast-style optimized structure is constructed directly, without forming a generic lookup table first.

Optimizing Generic Lookup Table Construction

When the query is large or the database is small (or even when not performing a database search but just aligning two sequences), the time needed to construct the generic lookup table starts to become more important. Fortunately, there are many optimizations possible that can reduce the required time by up to two orders of magnitude.

The nucleotide case is easiest: in this case we only care about exact matches, so there is no need to construct all of the possible neighboring words. Every query word updates only one lookup table location, so that the inner for() loop in the code above is not necessary at all. Generic lookup tables are constructed by BlastLookupIndexQueryExactMatches (BLAST_lookup.c), and the construction process is fast enough to never be a bottleneck.

Protein BLAST does require neighboring word comparisons. Here there are two problems to solve: generating all W-letter words efficiently, and performing the alignment comparison efficiently. Both of these jobs are performed by BlastAaLookupIndexQuery. The most important case is W=3, since this is most common; W=2 is already fast enough, but ideally higher values of W should be possible as well. Protein generic lookup tables are constructed by BlastAaLookupIndexQuery (BLAST_aalookup.c), and there are separate code paths for ordinary score matrices and position-specific score matrices. We will only describe the former in detail.

With a 28-letter alphabet and W = 3, there are 28*28*28 = 21952 three-letter words possible. For large queries (especially translated queries, where the underlying nucleotide sequence can be large) it is reasonable to expect the same 3-letter word to appear again and again in the query. It is a waste of effort to recalculate the inner for() loop above for a query word that has appeared before; the current query offset will go into all of the same lookup table locations calculated the last time the same word appeared.

To take advantage of this fact, s_AddNeighboringWords actually performs the generic lookup table generation process twice. The first lookup table will only contain exact matches; this has the effect of grouping together query offsets that refer to the same W-letter word. Next, the code walks through this temporary lookup table. If a table entry contains one or more query offsets, the full neighboring computation is performed on the first query word in the list. Whenever this first offset is placed into an entry in the generic lookup table, the entire chain of offsets is also added to that entry. This basically means that no matter how large the query sequence actually is, the neighboring loop will only run at most 28^W times (and probably much less than that). This optimization is completely orthogonal to the problem of making the neighboring loop run fast.

The original BLAST code created the entire list of W-letter words, and then scanned it for words that score highly when aligned with the query word. When W=3 the memory used is modest, but for larger W this is impractical.

If W was fixed it would be tempting to build up the next word incrementally:

for (i = 0; i < 28; i++) {
 change position 0 to letter i
   for (j = 0; j < 28; j++) {
    change position 1 to letter j
    for (k = 0; k < 28; k++) {
      change position 2 to letter k
    }
  }
}

This uses no memory at all, but the number of loops must be known beforehand. A compromise would be to work recursively, as the function s_AddWordHitsCore (BLAST_aalookup.c) does:

MakeWord(int pos, char buf) {
 if (position == W-1) {
   for (i = 0; i < 28; i++) {
    set buf[pos] to letter i
    calculate alignment score and add
       the query offset to cell HASH(buf) of
       the generic lookup table if the score
       exceeds T
   }
   return;
 }
 for (i = 0; i < 28; i++) {
   set buf[pos] to letter i
   MakeWord(pos+1, buf)
 }
}

The overhead here is somewhat higher, but at least arbitrary W can be handled without running out of memory. The big advantage to this method, however, is that it makes pruning whole groups of database words very easy. Remember that we're looking for database words that score above a threshold T when aligned with the query word. Suppose a database word has W-1 of its letters specified, and has a score S for aligning those letters with the first W-1 letters of the query word. If the last letter cannot achieve a score of at least T - S when aligned with the last letter of the query word, then there is no need to check any of the 28 database words; we know beforehand that they cannot achieve the threshold T. This check can also be extended to positions before the last letter too, and if pruning happens then even more database words are skipped.

The old BLAST engine implemented pruning on the last two letters of the word, but with the recursive formulation above we can implement pruning at all letter positions, with only minor changes. The score begins with the best score possible for the given query word, and as more letter positions in database words are fixed the score gradually converges to the correct value for a given database word.

find_neighbors(query_word) {
 max_score[i] = the maximum score of row i
         of the score matrix
 curr_score = 0
 for (i = 0; i < W; i++)
   curr_score += max_score[query_word[i]]
 MakeWord(0, buf, curr_score)
}

MakeWord(int pos, char buf, int curr_score) {
 curr_score -= max_score[pos] /* remove best-case score */
 if (pos == W-1) {
   for (i = 0; i < 28; i++) {
    set buf[pos] to letter i
    calculate alignment score and add
        the query offset to cell HASH(buf) of
        the generic lookup table if the score
        exceeds T
   }
   return;
 }

 for (i = 0; i < 28; i++) {
   /* replace the best-case score with the exact
    score of aligning query_word[pos] to letter i,
    and only recurse if there's still a possibility
    to exceed the threshold */
   new_score = curr_score + matrix[query_word[pos]][i];
   if (new_score >= T) {
    set buf[pos] to letter i
    MakeWord(pos+1, buf, new_score)
   }
 }
}

For sensibly chosen values of T, the pruning reduces the number of database words that must be generated by an order of magnitude. Further, because pruning takes place for all positions in the word, lookup table construction is much faster for W > 3.

There is a subclass of protein searches that need special consideration when the lookup table is being constructed. Query sequences that are accompanied by a Position-Specific Score Matrix (PSSM) have a separate score matrix row for each offset in the query sequence. Unlike the square-matrix case explained above, the offsets of identical query words cannot be lumped together (since their score matrix rows need not be identical). Further, while the same recursive procedure can be used to generate neighboring words to compare to query words, the maximum score of each PSSM row must be determined. The top-level function that performs this is s_AddPSSMNeighboringWords and the recursive function that fills the generic lookup table is s_AddPSSMWordHits.

Optimized Protein Lookup Table

Originally BLAST used a structure equivalent to the generic lookup table to conduct the actual database search. However, in the late 1990s it became clear that using the generic table to retrieve query offsets caused problems with memory latency. By the time the table was constructed, query offsets were sitting in arrays of integers that were scattered all over memory, and the hashtable itself exceeded the cache size of current microprocessors. Hence the act of retrieving query offsets from the lookup table would involve one read of the hashtable (causing a cache miss), and if the resulting pointer indicated that query offsets were present at that position then they would be retrieved from the relevant list (causing another cache miss). Actually, early versions of BLAST stored all of the query offsets in a single array, with query offsets that belonged to the same hashtable cell being linked together through pointers. Not only did this double the memory consumption, but the act of traversing the array involved a cache miss for every query offset retrieved, since offsets were generally not contiguous in memory.

Because accessing the lookup table for query offsets is potentially critical for performance, there is a second round of processing performed on the generic lookup table after all of the query offsets are loaded into it. This phase, originally contributed by Compaq in the late 1990s and currently implemented via the function s_BlastAaLookupFinalize (BLAST_aalookup.c), converts the generic lookup table into a packed form, and adds auxiliary data structures that help reduce cache misses later.

In the common case, the query sequence is not very large; this in turn means that often the generic lookup table is mostly empty. BlastAaLookupTable::pv_array (pv = 'presence vector') is used to exploit this; it is a bitfield with one bit for each cell in the hashtable. If an entry in the hashtable contains query offsets, the corresponding bit in the PV array is set. Later code first checks the PV array to see whether there are any query offsets in a particular lookup table entry. Because the bitfield takes up much less memory than the actual thin backbone, it can potentially stay in cache even when the hashtable cannot. In principle, when the query is short this means that many cache misses are avoided.

A second optimization replaces the hashtable with a 'thick backbone', an array of type AaLookupBackboneCell. This structure contains one word for the number of query offsets in that lookup table entry, and AA_HITS_PER_CELL (=3) additional words. Since the common case has most backbone cells empty, it follows that backbone cells which are not empty will not be very full either, on average. If the number of query offsets at a given position in the lookup table is AA_HITS_PER_CELL or less, then all query offsets are stored directly in the AaLookupBackboneCell. All backbone cells that contain more than this number of query offsets have those offsets packed contiguously into BlastAaLookupTable::overflow array. In those cells, one of the words in the AaLookupBackboneCell is the offset into the overflow array where the query offsets for that lookup table position begin, and the other words are left unused.

Accessing the lookup table therefore involves checking the PV array (which is usually in cache), then checking the thick backbone (causing a cache miss). If the number of offsets to retrieve is small, they are available for free (since the cell is now in cache). Otherwise the offsets are accessed sequentially from the overflow array, usually causing a cache miss but possibly not (the overflow array may fit in L2 cache, and in any case retrieving one group of offsets can pull its neighbor(s) into cache as well).

Increasing the amount of reuse is even more important today than it used to be, since cache has gotten much faster than main memory (i.e. a factor of 20 faster instead of the factor of 3 that was typical when these optimizations were first implemented). My belief is that the sort of tuning that made sense for an Alpha with 4MB of cache (the original target) is no longer optimal for e.g. a Pentium with 512kB of extremely fast cache and much slower RAM.

Optimized Nucleotide Lookup Table

Nucleotide BLAST can also benefit from a compressed lookup table, but the characteristics of a nucleotide search are different from that of a protein search. Nucleotide queries can be very small or very large, the word size _W _can be small or large, and in the nucleotide case the generic lookup table only indexes exact matches. Thus the lookup table size and number of entries varies widely (although the average lookup table is very sparsely populated, much less so than the blastp lookup table). The performance regime in which nucleotide BLAST operates is therefore quite wide, and optimal performance requires data structures that can adapt to the specifics of the search.

Experience shows that the vast majority of nucleotide searches have tiny queries. In that case, the lookup table is very sparsely populated and hits to database sequences are pretty rare. Most of the time the generic lookup table indexes 8-mers, i.e. groups of 8 contiguous bases. While it is possible to use a nucleotide version of the BlastAaLookupTable structure, in practice this is overkill. The vast majority of bits in the PV array are empty, so the latency to access those bits is the limiting factor when scanning a database sequence. Unfortunately, profiling shows that very often on x86 hardware, the PV array cannot stay in the fastest processor caches during the search, possibly because the other parts of the search kick it out. Even when the whole array is in cache, checking the PV array takes a lot of time. Possibly this is because the sequence of machine instructions that construct the array index to test, load the word containing the appropriate bit and then test for bit set introduce data dependencies that mean that a modern out-of-order CPU cannot process many scanning iterations in parallel. Combined with the fact that the PV array is in secondary cache much of the time, nucleotide BLAST in these circumstances would be faster on average if the backbone was accessed directly, assuming the backbone itself is small enough to reside in secondary cache.

Cameron and Williams were the first to realize that retuning the optimized lookup table would yield increased search performance, and their techniques have been ported to NCBI BLAST. If the nucleotide search is such that:

the maximum query offset to index is less than 32768
the number of query words to index is less than 32768

then BLAST will create a lookup table of type BlastSmallNaLookupTable (BLAST_nalookup.h) by calling BlastSmallNaLookupTableNew(BLAST_nalookup.c). This function will create the generic lookup table and then call s_BlastSmallNaLookupFinalize to create the optimized table. The latter function can fail, since it is theoretically possible that a query sequence be sufficiently compositionally biased that it generates too many words to index, especially if both query strands are considered together. If that happens, or if the query sequence clearly is not suitable for creation of a BlastSmallNaLookupTable, then BLAST will fall back to creating a nucleotide version of the protein lookup table (BlastNaLookupTable) by calling BlastNaLookupTableNew(BLAST_nalookup.c). If the SmallNa style of lookup table is possible, it should always be used; while there could be no performance benefit (depending on the underlying machine), I have not seen a case where it would be more efficient to use the larger lookup table, given the choice.

The SmallNa table contains a backbone, but each backbone cell contains just one 16-bit integer. The overflow array is also a list of 16-bit integers. If the value at a particular cell is nonnegative, then that position in the hashtable contains exactly one query offset, equal to the cell value. If the value is -1, the cell is empty. Finally, if the cell value is -x (x > 1) then the query offsets for the cell begin at offset x of the overflow array, and continue until a negative query offset is encountered. The use of 16-bit integers gives rise to the restrictions on the query described above, and the structure of the overflow array means that one cannot tell whether the SmallNa table is allowed until the overflow array is actually built (by calling s_BlastSmallNaLookupFinalize), since it contains extra entries that are not query words.

When the SmallNa table indexes 8-mers, the working set for the backbone is 128kbytes and the overflow array is typically quite small. There is no PV array, so accessing the lookup table involves computing the table index and accessing the hashtable with a single load instruction. This short instruction path means that out-of-order processors can overlap many lookup table accesses simultaneously, and if a PV array was going to be in external cache anyway then not using it actually reduces cache accesses. The end result is a 15-20% overall speedup on Intel hardware for small queries.

Megablast Lookup Table

Megablast was designed for nucleotide queries that are very large, and for finding alignments that are very high-scoring; this implies a large value of W.

The lookup table used by megablast indexes 9-12 letter exact matches with query words, but many of the principles explained above still apply. The main megablast lookup table structure is BlastMBLookupTable, defined in BLAST_nalookup.h.

Unless the query is extremely large (millions of bases), the size of the lookup table will guarantee that the vast majority of table cells are empty. Most lookup table entries that are not empty will have very few query offsets (usually just one). There are two main data structures used to hold query offsets: BlastMBLookupTable::next_pos is an array of integers the size of the query sequence; offset i in this array denotes query offset i. Also, BlastMBLookupTable::hashtable is a thin backbone with one word for each of the lookup table entries. If a lookup table entry has no query offsets, the corresponding entry in hashtable[] is zero; otherwise it is an offset into next_pos[].

The procedure for building the megablast lookup table is as follows (note that the functions s_FillContigMBTable and s_FillDiscMBTable in BLAST_nalookup.c are much more complicated):

for (i = 0; i < query_length - W; i++) {
 query_word = [query[i],query[i+1],...query[i+W-1] ]
 index = HASH(query_word)
 next_pos[i] = hashtable[index]
 hashtable[index] = i;
}

When complete, the megablast lookup table is a giant collection of linked lists. Most of these lists are empty; the head of list i is hashtable[i]. This value is the first query offset, and in turn gives the offset into next_pos[] of the next query offset. Making next_pos the size of the query sequence means that query offsets don't have to be explicitly listed; rather, the position in next_pos[] is synonymous with the query offset, and the actual value at that position serves as a 'next pointer' to the succeeding query offset in the chain. A value of zero universally means end of chain, so that all offsets must actually be in the range [1,query_length] and not [0,query_length-1].

Accessing the lookup table involves calculating HASH(database_word), reading the hashtable[] word at this index, and if nonzero retrieving the linked list of query offsets and decrementing each such offset. Since non-empty lists are rare, and all lists are short, there is no need for extra packing of next_pos[].

For much more on the process of optimizing use of the lookup table in megablast, see the Appendix "Optimization of Megablast"

Merging Blastn and Megablast

Megablast was originally developed as a separate collection of source in the old engine; it has its own lookup table, its own extension routines and its own gapped alignment infrastructure, all distinct from the machinery of blastn searches. Recent progress with the new engine has allowed blastn and megablast to look much more like each other, to the point where the BLAST engine can automatically choose the most efficient method used to align nucleotide sequences. In fact, other than using greedy gapped extension, megablast is just blastn with different data structures.

The improvements start with two fundamental questions: why does the blastn lookup table have to be 4 or 8 letters wide, and why does the megablast lookup table have to be 12 letters wide? In the beginning the tables had to be a multiple of four letters wide because the scanning of database sequences happens 4 letters at a time, and making everything a multiple of four meant that a compressed database sequence could be scanned a full byte at a time. However, BLAST currently uses improved scanning methods that need strides different from 4 anyway. The other big reason these sizes were used was that the code was written that way and changes were too difficult. The new BLAST engine is modular enough that these fundamental statements about the way BLAST works can all change, and change adaptively to suit each particular search.

The runtime needed for the database scanning and initial word extensions depends on several factors. If a lookup table word has more letters, the table is accessed less often and so the number of initial extensions is reduced; but more letters mean a larger table whose cache performance is worse. A smaller table has the opposite effect: the latency of table access is reduced but the number of hits and the number of extensions performed on those hits both increase. Thus, for each search there is an optimal table size that represents a good balance between these conflicting goals. A small query will not fill up even a narrow lookup table, and so should benefit from the improved cache performance of a table with fewer letters per word. Likewise, a large query will saturate an ordinary-size lookup table, and it's more efficient to use a wide table that is accessed much less often. To add more variables, the initial word extension phase of the new engine does not care what sort of lookup table produced hits that need extending; all it wants to know is where the hits are located. This means we can use a megablast lookup table for a blastn search or a blastn lookup table for a megablast search.

Currently, blastn lookup tables can contain 4-8 letters per word (inclusive) and megablast lookup tables can contain 9-12 letters per word (inclusive). The type of lookup table and its width depend on the number of words in the query that need indexing (which can be accurately estimated, and is typically close to 2x the number of bases in the query) and the word size.

The following table gives the choices the new engine will make; all of these were experimentally determined, and the optimal choice will probably vary between different microprocessors. The logic to choose the lookup table type and width is located in the function BlastChooseNaLookupTable (BLAST_nalookup.c). Note that 'blastn' below means the BlastSmallNaLookupTable type by default, or BlastNaLookupTable if the SmallNa table cannot be used on the query.

number of words to index	search word size	chosen table type	chosen table width
any	4	blastn	4
any	5	blastn	5
any	6	blastn	6

0 - 250	7	blastn	6
250+	7	blastn	7

0 - 8500	8	blastn	7
8500+	8	blastn	8

0 - 1250	9	blastn	7
1250 - 21000	9	blastn	8
21000+	9	megablast	9

0 - 1250	10	blastn	7
1250 - 8500	10	blastn	8
8500 - 18000	10	megablast	9
18000+	10	megablast	10

0 - 12000	11	blastn	8
12000 - 180000	11	megablast	10
180000+	11	megablast	11

0 - 8500	12	blastn	8
8500 - 18000	12	megablast	9
18000 - 60000	12	megablast	10
60000 - 900000	12	megablast	11
900000+	12	megablast	12

0 - 8500	13+	blastn	8
85000 - 300000	13+	megablast	11
300000+	13+	megablast	12

BLAST-docs Home

The Developer's Guide To BLAST

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BLAST Lookup Table Phase

Use of Lookup Tables

Optimizing Generic Lookup Table Construction

Optimized Protein Lookup Table

Optimized Nucleotide Lookup Table

Megablast Lookup Table

Merging Blastn and Megablast

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BLAST-docs Home

Clone this wiki locally

offset	query	subject	score
6	ALW	ALY	10
7	LWD	LYD	12
8	WDY	YDY	15
9	DYE	DYK	14
10	YEP	YKK	8
22	EGD	LGD	9
23	GDC	GDI	11
26	MTI	LTV	10

offset	query	subject	score
6	ALW	ALY	10
7	LWD	LYD	12
8	WDY	YDY	15
9	DYE	DYK	14
10	YEP	YKK	8
22	EGD	LGD	9
23	GDC	GDI	11
26	MTI	LTV	10

offset	query	subject	score
6	ALW	ALY	10
7	LWD	LYD	12
8	WDY	YDY	15
9	DYE	DYK	14
10	YEP	YKK	8
22	EGD	LGD	9
23	GDC	GDI	11
26	MTI	LTV	10