Skip to content

Representation of Sequences in BLAST

christiam edited this page Aug 12, 2014 · 3 revisions

The blast engine core represents a biological sequence in one of only a few different ways, and sometimes adds additional information used in the course of a search. It is the job of the 'api' directories of the C and C++ toolkits to convert sequences from whatever format they happen to be in to the standardized format that the engine core expects. The tables that describe these formats can be found at http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/BIOSEQ.HTML

At the lowest level, all biological sequences are represented by a BLAST_SequenceBlk structure (defined in blast_def.h). The contents of this structure, however, depend on the type of sequence represented. This data structure is designed to represent multiple concatenated sequences in a transparent way.

Amino Acid Sequences

The core blast engine expects all protein sequences to be in ncbistdaa format; sequence letters assume a value of 0 to 27 inclusive, representing the 22 amino acids and a collection of miscellaneous characters (gaps, end-of-sequence, various ambiguity characters). Note that portions of the blast engine will fail if any sequence letters are actually zero (the gap character), so these must be filtered out before calling the engine.

Protein sequences are represented as an array of unsigned chars, with one amino acid residue per array element. If a SequenceBlk is supposed to represent multiple sequences, all of the sequences are concatenated one after the other with a single zero byte (a 'sentinel') in between each pair of sequences. The number of residues in the sequence is given by BLAST_SequenceBlk::length; in the case of multiple sequences, the length field refers to the combined length of all sequences, plus the sentinel bytes in between them. Note that if a SequenceBlk contains multiple sequences, all the information needed to determine where one sequence ends and another begins, as well as other per-sequence information, is placed in a BlastQueryInfo structure.

The SequenceBlk::sequence field points to the array of sequence data. Elements 0 to length-1 of this array are assumed to point to valid sequence letters. The SequenceBlk::sequence_start field is used to point to sequence data that is imported from other structures. This field is necessary because sometimes, for example when reading blast databases, the first and last sequence letters are actually sentinel bytes. Rather than knowing when and when not to ignore the first letter, the sequence_start field points to the array as it really is and the 'sequence' field always points to valid sequence data.

Nucleotide Sequences

Nucleotide sequences may be stored in one of three ways in a BLAST_SequenceBlk.

ncbi4na format stores the sequence as an array of unsigned chars, with one nucleotide per array element. ncbi4na represents nucleotides as a bitfield four bits wide, with each bit representing one of the four nucleic acids. This bitfield is necessary to properly represent ambiguity characters, where it is unknown which one of several nucleic acids may actually be present at that position in the sequence. As with protein sequences, 0 is treated as a sentinel byte and should not occur within the sequence.

ncbi2na is a compressed representation of a nucleotide sequence. Each nucleic acid is represented as a 2-bit value, and four nucleotides are packed into each byte of an unsigned char array. Within a byte, bits 6-7 store nucleotide 0, bits 4-5 store nucleotide 1, etc. The NCBI2NA_UNPACK_BASE macro (blast_util.h) is used to extract individual nucleotides from their packed representation.

Ambiguities cannot be represented in ncbi2na format. The preliminary stage of blast will only process the ncbi2na version of each subject sequence, while the traceback stage will add the ambiguities back in and recompute any alignments found. To save disk space, nucleotide blast databases store sequences in ncbi2na format along with auxiliary information needed to restore ambiguities. When the database is created, each ambiguous character in the ncbi2na version of the sequence is randomly assigned one of the nucleic acids that the ambiguity allows. This random assignment can change depending on the position of the sequence within the database, and this is expected. Basically one cannot rely on a particular combination of letters appearing when there is no guidance on what those letters should be.

The length field of the BLAST_SequenceBlk gives the number of nucleotides in the sequence, not the number of bytes used to store it. Finally, note that no sentinel byte value is unique for ncbi2na (since 4 nucleotides in a byte use up all bit combinations possible), so that it is impossible to have multiple concatentated sequences in ncbi2na format. Future changes could allow this, but implementations cannot rely on sentinel bytes to distinguish between sequences.

Finally, blastna format is a permutation of ncbi4na. The difference is that values 0-3 stand for nucleotides, and values > 3 stand for ambiguities. This allows easy comparisons with letters in ncbi2na format; it also allows a 4x4 nucleotide score matrix to become the top left corner of the 16x16 full score matrix for nucleotides. If a SequenceBlk represents multiple concatenated nucleotide sequences, then 0x0f is treated as a sentinel byte (since 0 is a valid sequence letter).

The following table describes the various internal encodings used for nucleotide blast searches:

BLAST Program Query Sequences Database Sequences
blastn read in as blastna ncbi2na for preliminary search, blastna for traceback search
blastx read in as ncbi4na, converted to ncbistdaa ncbistdaa (protein database)
tblastn ncbistdaa (protein query) read in as ncbi4na, converted to ncbistdaa
tblastx read in as ncbi4na, converted to ncbistdaa read in as ncbi4na, converted to ncbistdaa

The choice of encoding for database sequences is given by the function Blast_TracebackGetEncoding (blast_traceback.c), while the C API uses the first portion of s_SeqLocFillSequenceBuffer (blast_seq.c) and the C++ API uses GetQueryEncoding (blast_setup_cxx.cpp).

If one has an array of bytes and the number of elements in the biological sequence, whether amino acids or nucleotides, then calling BlastSeqBlkNew and then BlastSeqBlkSetSequence will convert the array to a BLAST_SequenceBlk.

When performing a blastn search, by default both the plus and minus strand of the nucleotide query participates in the search. The setup code outside the core engine assumes that nucleotide sequences are plus-strand only when passed in, and explicitly constructs the minus strand as a second query sequence.

Translated Sequences

Additional fields are needed within the SequenceBlk to represent a nucleotide sequence that has been translated into its six reading frames. There are two representations of translated sequences, in-frame and out-of-frame.

In-frame translation assumes that the entire nucleotide sequence exists in the same frame. The nucleotide sequence is translated one frame at a time, and the six complete translations are concatenated together and placed in the SequenceBlk. From this point onward, the SequenceBlk behaves as a collection of six concatenated protein sequences (frames 1, 2, 3, -1, -2, -3, in that order). For ncbi4na format the translation is performed by the function BLAST_GetTranslation, while ncbi2na format uses the function BLAST_TranslateCompressedSequence. Both assume that the genetic code that governs the conversion of nucleotide triplets to protein letters has already been set up. This routine translates a single sequence; the engine assumes that if multiple nucleotide sequences are concatenated then their translations are concatenated in the same order. Hence N nucleotide sequences should be turned into 6N protein sequences before being passed to the core engine as a query for a translated search. The C toolkit function BLAST_SetUpQuery and the C++ toolkit function SetupQueries perform this work.

An out-of-frame (OOF) sequence may slip from one frame to another on a protein-letter by protein-letter basis. The engine uses the in-frame representation for computing ungapped alignments, but when performing out-of-frame gapped alignments it is necessary to keep use separate 'mixed frame' representation of the translated sequence. This contains all the same information as the in-frame sequence, but letters from different frames are interleaved (see function BLAST_InitDNAPSequence). Frames 1, 2, and 3 are mixed together, as are the protein letters of frame -1, -2 and -3. Sentinel bytes are all kept, and stacked three deep at the beginning and end of the mixed-frame sequence so that the in-frame and OOF sequences have the same length. After creation, the mixed-frame sequence is stored in BLAST_SequenceBlk::oof_sequence, and the oof_allocated boolean tells whether this data should be explicitly freed when the SequenceBlk is deallocated (this is currently always necessary). Note that OOF alignments require that the SequenceBlk contains both the in-frame and the OOF version of a sequence; out-of-frame format is a convenience that is only used deep in the blast engine when gapped alignments must be performed; for all other purposes the in-frame version of the sequence is used.

OOF tblastn reads each nucleotide database sequence, converts to the in-frame translated sequence, and then also produces the OOF version of that sequence. OOF blastx is more complicated, because filtering can change the sequence letters. The nucleotide query is read in as ncbi4na, and converted to its in-frame translated form immediately. The production of the OOF version of the sequence must wait until BLAST_MainSetup, so that filtering can first be applied to the set of 6 protein sequences. The filtered versions of the sequences are then interleaved into OOF format. This asymmetry saves having to filter twice (and the filtering code does not know about OOF format anyway).

The conversion of nucleotide letters to protein letters happens under the control of a genetic code. However this code is represented, when translation actually happens the genetic code is a 64-entry lookup table that converts a triplet of nucleotides into a single protein letter. One genetic code is used for virtually all nucleotide sequences, but sometimes a different genetic code is required (for example, in the case of exotic microbial sequences).

One subtle issue is that all query sequences are assumed to use the same genetic code, and all database sequences use the same genetic code (possibly different from the one used in the query). The latter is very important, because it means that arbitrary nucleotide sequences cannot just be thrown together into a database if translated searches are a possibility. Some of those sequences may want a genetic code different from that of other sequences, and while the engine is flexible enough to change the genetic code on a per-subject-sequence basis, there is currently no room in the blast database format for a variable genetic code.

An even more subtle issue relates to how ambiguities are handled. Translating a sequence in ncbi2na format is easy, and can be highly unrolled and optimized (see the routine BLAST_TranslateCompressedSequence). When ambiguous nucleotide letters are a possibility, the routine that performs translation is s_CodonToAA. Genetic codes do not give any guidelines for converting ambiguous nucleotides, so this routine adopts a reasonable compromise: whenever an ambiguity is encountered in one of the three nucleotide letters, the protein letter corresponding to every possible choice of that ambiguity is computed. If all combinations produce the same protein letter, that letter is the result; otherwise the result is an X character. This process can be made much more efficient, but translation from ncbi4na format is not currently on any critical path.

Maintaining per-Sequence Information

The preliminary stage of the blast engine currently aligns multiple query sequences simultaneously against a single database sequence (if multithreaded, the query sequences are the same in all threads). Sometimes the core must distinguish between individual query sequences, while other times the complete collection is treated like a single query. Note that different frames of a translated query, or different strands of a nucleotide query, are treated by the engine as if they were distinct sequences. This generalization means that the engine may be aligning multiple queries simultaneously even if only a single query was provided.

Information that is unique to individual query sequences is stored in the BlastQueryInfo structure, described in blast_def.h. The C toolkit uses s_QueryInfoSetUp and the C++ toolkit uses SetupQueryInfo_OMF to initialize BlastQueryInfo.

Every individual query sequence, whether a true distinct sequence or one frame of a translated sequence, is assigned a unique number. This 'context' unambiguously identifies that sequence among all the others concatenated together. The rest of the blast engine can mostly deal with contexts only. Information specific to a single context is stored in a BlastContextInfo structure, and BlastQueryInfo::contexts contains the list of all contexts for the query to a blast search. The data in this structure must match up to the actual sequence data in a BLAST_SequenceBlk.

Each context contains several pieces of information:

query_offset The offset of the first valid sequence letter within BLAST_SequenceBlk::sequence that belongs to this context. If the context is not valid, this number is set to (query_offset of previous context) + (length of previous context) + 1. Note that if element 0 of the context array is invalid, its query offset doesn't matter, since loops that scan through all contexts willskip over element 0
query_length The number of residues in this context. If the context is not valid, this must be zero
eff_searchsp The effective search space, which is used for calculating the e-values of blast hits involving this context. If the context is not valid, this must be zero
length_adjustment Used to compute the effective search space
query_index Used for determining the actual query sequence to which a context belongs. For example, this is useful for mapping frames of a translated sequence back to the query sequence that produced those translated frames. Query indices run from 0 to (number of actual query sequences - 1)
frame Number describing which translation frame corresponds to this context. Protein sequences have frame = 0 for all contexts. Nucleotide sequences have frame = 1 or -1, corresponding to the plus and minus strands of each query sequence (the contexts occur in that order). Translated nucleotide sequences have frame = +-1, +-2, or +-3. As with the SequenceBlk, contexts for translated nucleotide queries have the plus strand frames in order and then the minus strand frames in order
is_valid TRUE if context participates in the search, FALSE if not

For protein queries, all contexts are always valid. For other types of queries, contexts within the context array may not be valid if the search is expected to ignore the plus strand or the minus strand of the query. By default, both strands are searched and all contexts are valid. We've gone back and forth a number of times on this issue, and the current thinking is that it is better to have invalid contexts than to remove them from the context array. There are many places in the core engine that perform arithmetic that assumes nucleotide queries have pairs of frames, and translated queries have frames in groups of six. The special treatment of the query_offset field is to make sure that binary searches through the context array can locate the correct context for a given query offset, even if the array contains invalid contexts. Note that if the plus or minus strand is ignored, BLAST_SequenceBlk does not contain the sequence data for that strand, but BlastQueryInfo does contain (invalid) BlastContextInfo structures for the missing strand.

For example, given two nucleotide queries to a blastn search, of size 50 and 100, if searching both strands then the SequenceBlk contains 4 sequences of length 50, 50, 100 and 100, separated by sentinel bytes.

context query_offset query_length query_index frame context is
0 0 50 0 1 valid
1 51 50 0 -1 valid
2 102 100 1 1 valid
3 203 100 1 -1 valid

while for the plus strand only, the SequenceBlk has two sequences of size 50 and 100, separated by a sentinel byte.

context query_offset query_length query_index frame context is
0 0 50 0 1 valid
1 51 0 0 -1 invalid
2 51 100 1 1 valid
3 152 0 1 -1 invalid

For the minus strand only, the SequenceBlk has the same size but contains reverse complement sequences. The QueryInfo looks like

context query_offset query_length query_index frame context is
0 0 0 0 1 invalid
1 0 50 0 -1 valid
2 51 0 1 1 invalid
3 51 100 1 -1 valid

Note that in order to locate the context where a particular hit occurs, the engine uses binary search on BlastQueriyInfo::query_offset (see BSearchContextInfo). This binary search finds the smallest query_offset that is greater than a given sequence offset, and picks the context one back from that. Even when two query_offset values are the same, this will locate the correct context.

Other fields of interest in BlastQueryInfo:

num_queries the number of query sequences before translation or creation of minus strands. May be less than the number of contexts
first_context the offset in the contexts[] array of the first valid context. This number may be nonzero
last_context the offset in the contexts[] array of the last valid context. This number may be less than 2*N-1 for N nucleotide queries, and may be less than 6*N-1 for N translated nucleotide queries
max_length the maximum number of letters in any context

Any loops over contexts should check is_valid field of each context in the list, and should iterate only from first_context to last_context inclusive.

Representing Database Sequences

Since at present the core engine deals with a single database sequence at a time, there is no need to set up a counterpart to BlastQueryInfo for every database sequence. The minus strand of database sequences is never searched by blastn; instead, hits to the minus strand of a query are turned into hits to the plus strand of the query but the minus strand of the database sequence. In the case of translated database sequences (i.e. tblastn), each database sequence is individually converted into its in-frame (and possibly out-of-frame) translation. Then s_BlastSearchEngineCore iterates through each of the frames, performing one pairwise search at a time between all the query sequences and that one frame.

Another exceptional case is RPS blast, which reverses the role of query sequence and database sequence. RPS blast concatenates the entire protein database into a single giant sequence, then switches the role of query and database before running the rest of the engine, which must then see a query sequence with a huge number of contexts (the whole database worth). Because of this, a partial BlastQueryInfo structure has to be constructed using s_RPSOffsetArrayToContextOffsets

Clone this wiki locally