Skip to content
Jouni Siren edited this page Mar 14, 2018 · 21 revisions

General

Queries are basic operations working in a single thread. They can be called concurrently from multiple threads, because they are const member functions. If something fails, a high-level query returns a neutral value. This can be value 0, an empty range, or an empty vector. A failed low-level query may crash or return an unspecified value.

Utility methods

Ranges

The methods for handling integer ranges are defined in utils.h.

typedef std::uint64_t size_type;
typedef std::pair<size_type, size_type> range_type;

Closed ranges are represented as pairs of integers. If rng is a range, it denotes the integers from rng.first to rng.second (inclusive). If rng.first + 1 > rng.second + 1, the range is empty. This means that range_type(0, -1) is empty, even though the integers are unsigned.

struct Range defines a number of static functions:

  • size length(range_type range): The length of range.
  • bool empty(range_type range): Is range empty?
  • size_type bound(size_type value, range_type range): Adjusts value up or down until it falls within range. If range is empty, returns range.first.
  • size_type bound(size_type value, size_type low, size_type high): As above with range_type(low, high) as range.
  • range_type empty_range(): Returns the empty range range_type(1, 0).

Nodes and positions

The mapping between the nodes of the input graph and the positions in the original graph is defined in support.h. See Graph Formats for further details.

typedef std::uint64_t node_type;

struct Node defines the mapping as the following static functions:

  • node_type encode(size_type node_id, size_type node_offset, bool reverse_complement): The node corresponding to offset node_offset in the forward or reverse complement strand (according to reverse_complement) of node node_id of the original graph.
  • node_type encode(size_type node_id, size_type node_offset): As above, but only for the forward strand.
  • size_type id(node_type node): The identifier of the node of the original graph corresponding to node node of the input graph.
  • bool rc(node_type node): Does node node of the input graph map to the reverse complement strand?
  • size_type offset(node_type node): The offset in the node of the original graph corresponding to node node of the input graph.
  • std::string decode(node_type node): A human-readable representation of the position in the original graph corresponding to node node of the input graph.
  • node_type encode(const std::string& token): Parses the human-readable representation token of a position in the original graph and returns the corresponding node of the input graph.

High-level queries

High-level queries work with characters and lexicographic ranges of path nodes. They are defined in gcsa.h.

find(P)

template<class Iterator> range_type find(Iterator begin, Iterator end) const
template<class Container> range_type find(const Container& pattern) const
template<class Element> range_type find(const Element* pattern, size_type length) const

Returns the lexicographic range of path nodes matching the pattern. The pattern can be given as a pair of bidirectional iterators, as a container with such iterators, or as a pointer to an array of characters.

Example: std::string p = "GATTACA"; range_type range = index.find(p); finds the range of paths matching pattern GATTACA.

Example: const char* p = "CTAATGT"; range_type range = index.find(p, std::strlen(p)); finds the range of paths matching pattern CTAATGT.

count(sp, ep)

size_type count(range_type range) const

This query counts the number of distinct start nodes in the given lexicographic range of path nodes. The query only works on ranges corresponding to suffix tree nodes, such as ranges returned by find() queries.

Example:

std::string p = "GATTACA"; range_type range = index.find(p);
size_type count = index.count(range);

This counts the number of distinct start nodes of all paths matching pattern GATTACA.

locate(sp, ep)

void locate(size_type path, std::vector<node_type>& results, bool append = false, bool sort = true) const
void locate(range_type range, std::vector<node_type>& results, bool append = false, bool sort = true) const
void locate(range_type range, size_type max_positions, std::vector<node_type>& results) const

This query determines all start nodes in the given lexicographic range of path nodes.

  • path: The lexicographic rank of a single path node.
  • range: A lexicographic range of path nodes.
  • max_positions: Return at most this many start nodes. If there are more, the located subset is chosen randomly.
  • results: The vector where the results should be written to.
  • append: Append the results to the existing contents of the vector instead of clearing it first.
  • sort: Sort the start nodes and remove duplicates.

Example:

std::string p = "GATTACA"; range_type range = index.find(p);
std::vector<node_type> results; index.locate(range, results);

This finds the set of distinct start nodes of all paths matching pattern GATTACA.

Low-level queries

Low-level queries work with comp values and lexicographic ranges of path nodes. They are defined in gcsa.h. These queries generally do not make any sanity checks for the parameter values.

Statistics

  • size_type size() const: Returns the number of path nodes in the path graph.
  • size_type edgeCount() const: Returns the number of edges in the path graph.
  • size_type order() const: Returns the order of the path graph. Queries longer than the order may produce false positives (but no false negatives).
  • size_type sampleCount() const: Returns the number of sampled nodes in the index.
  • size_type sampleBits() const: Returns the size of a sample in bits.
  • size_type sampledPositions() const: Returns the number of path nodes containing samples.

Searching

  • range_type charRange(comp_type comp) const: Returns the lexicographic range of paths starting with comp value comp.
  • range_type LF(range_type range, comp_type comp) const: Returns the lexicographic range of predecessors of the lexicographic range range of paths starting with comp value comp.
  • size_type LF(size_type path_node) const: Returns the lexicographic range of a predecessor of the path node with lexicographic rank path_node. This should only be used with path nodes with one predecessor (e.g. when sampled(path_node) is false).
  • void LF_fast(range_type range, std::vector<range_type>& results) const: Finds the lexicographic ranges of predecessors of the lexicographic range range of paths as results[comp] = LF(range, comp) for comp values comp encoded using fast bitvectors. By default this means comp values 1 to 4 corresponding to the bases ACGT.
  • void LF_all(range_type range, std::vector<range_type>& results) const: As LF_fast, but also include non-technical characters using sparse bitvectors. By default this adds comp value 5 corresponding to N.

Samples

  • bool sampled(size_type path_node) const: Tells whether the path node with lexicographic rank path_node contains samples.
  • range_type sampleRange(size_type path_node) const: Returns the range of sampled nodes for the path node with lexicographic rank path_node, assuming that it contains samples.
  • inline size_type firstSample(size_type path_node) const: Returns the position of the first sampled node for the path node with lexicographic rank path_node, assuming that it contains samples.
  • inline bool lastSample(size_type i) const: Tells whether the sampled node at position i is the last sampled node for a path node.
  • node_type sample(size_type i) const: Returns the ith sampled node.

LCP array

The following low-level queries are members of the LCPArray class, which is defined in lcp.h.

Statistics

  • size_type size() const: Returns the number of path nodes in the path graph.
  • size_type values() const: Return the total number of LCP values in the LCP array and the range minimum tree.
  • size_type levels(): Returns the number of levels in the range minimum tree.
  • size_type branching(): Returns the branching factor of the range minimum tree.

General operations

  • size_type operator[] (size_type i) const: Returns the ith LCP value, which tells the length of the longest common prefix of path nodes i - 1 and i in lexicographic order.
  • STNode root() const: Returns the root of the suffix tree.

parent(v)

STNode parent(const STNode& node) const
STNode parent(range_type range) const

Returns the parent of the suffix tree node node or the node corresponding to lexicographic range range. Stores the string depth of the parent in the returned node.

depth(v)

size_type depth(const STNode& node) const
size_type depth(STNode& node) const
size_type depth(range_type range)

Returns the string depth of the suffix tree node node or the node corresponding to lexicographic range range. The string depth of a node is the length of the longest pattern matching the corresponding lexicographic range. If the depth is already stored in node, the query returns the stored depth. If node is not const, the string depth is stored in it.

String depth is not well-defined for leaf nodes, and the query returns STNode::UNKNOWN. If node is a leaf node, its string depth can be anything from depth(parent(node)) + 1 to the order of the pruned de Bruijn graph.

Suffix tree nodes

Suffix tree nodes correspond to lexicographic ranges of path nodes. They are represented by STNode objects, which support the following queries.

  • range_type range() const: Returns the lexicographic range of path nodes corresponding to the node.
  • size_type lcp() const: Returns the LCP value or the string depth of the node. If the string depth has not been stored in the node, the return value is STNode::UNKNOWN. See also LCPArray::depth().

There is also an alias LCPArray::node_type for STNode. While it can be confused with gcsa::node_type, it may be useful for future compatibility with SDSL compressed suffix trees.

The following operators can be used to compare suffix tree nodes to other STNode objects and lexicographic ranges of path nodes.

  • bool operator== (const STNode& node) const
  • bool operator== (range_type range) const
  • bool operator!= (const STNode& node) const
  • bool operator!= (range_type range) const

Example

The following is adapted from verifyIndex() in algorithms.cpp. It determines whether the parent() operation works correctly with the range matching query string kmer.

  range_type range = index.find(kmer);
  if(!Range::empty(range))
  {
    STNode parent_result = lcp.parent(range);
    range_type find_result = range;
    std::string::iterator begin = kmer.begin(), end = kmer.end();
    while(find_result == range)
    {
      --end;
      find_result = index.find(begin, end);
    }
    if(parent_result != find_result || parent_result.lcp() != (size_type)(end - begin))
    {
      std::cerr << "parent" << range << " returned " << parent_result
                << ", expected " << find_result << " at depth " << (end - begin) << std::endl;
    }
  }