-
Notifications
You must be signed in to change notification settings - Fork 11
Query Interface
Queries are basic operations working in a single thread. They can be called concurrently from multiple threads, because they are const
member functions. If something fails, a high-level query returns a neutral value. This can be value 0
, an empty range, or an empty vector. A failed low-level query may crash or return an unspecified value.
The methods for handling integer ranges are defined in utils.h.
typedef std::uint64_t size_type;
typedef std::pair<size_type, size_type> range_type;
Closed ranges are represented as pairs of integers. If rng
is a range, it denotes the integers from rng.first
to rng.second
(inclusive). If rng.first + 1 > rng.second + 1
, the range is empty. This means that range_type(0, -1)
is empty, even though the integers are unsigned.
struct Range
defines a number of static functions:
-
size length(range_type range)
: The length ofrange
. -
bool empty(range_type range)
: Isrange
empty? -
size_type bound(size_type value, range_type range)
: Adjustsvalue
up or down until it falls withinrange
. Ifrange
is empty, returnsrange.first
. -
size_type bound(size_type value, size_type low, size_type high)
: As above withrange_type(low, high)
asrange
. -
range_type empty_range()
: Returns the empty rangerange_type(1, 0)
.
The mapping between the nodes of the input graph and the positions in the original graph is defined in support.h. See Graph Formats for further details.
typedef std::uint64_t node_type;
struct Node
defines the mapping as the following static functions:
-
node_type encode(size_type node_id, size_type node_offset, bool reverse_complement)
: The node corresponding to offsetnode_offset
in the forward or reverse complement strand (according toreverse_complement
) of nodenode_id
of the original graph. -
node_type encode(size_type node_id, size_type node_offset)
: As above, but only for the forward strand. -
size_type id(node_type node)
: The identifier of the node of the original graph corresponding to nodenode
of the input graph. -
bool rc(node_type node)
: Does nodenode
of the input graph map to the reverse complement strand? -
size_type offset(node_type node)
: The offset in the node of the original graph corresponding to nodenode
of the input graph. -
std::string decode(node_type node)
: A human-readable representation of the position in the original graph corresponding to nodenode
of the input graph. -
node_type encode(const std::string& token)
: Parses the human-readable representationtoken
of a position in the original graph and returns the corresponding node of the input graph.
High-level queries work with characters and lexicographic ranges of path nodes. They are defined in gcsa.h.
template<class Iterator> range_type find(Iterator begin, Iterator end) const
template<class Container> range_type find(const Container& pattern) const
template<class Element> range_type find(const Element* pattern, size_type length) const
Returns the lexicographic range of path nodes matching the pattern. The pattern can be given as a pair of bidirectional iterators, as a container with such iterators, or as a pointer to an array of characters.
Example: std::string p = "GATTACA"; range_type range = index.find(p);
finds the range of paths matching pattern GATTACA
.
Example: const char* p = "CTAATGT"; range_type range = index.find(p, std::strlen(p));
finds the range of paths matching pattern CTAATGT
.
size_type count(range_type range) const
This query counts the number of distinct start nodes in the given lexicographic range of path nodes. The query only works on ranges corresponding to suffix tree nodes, such as ranges returned by find()
queries.
Example:
std::string p = "GATTACA"; range_type range = index.find(p);
size_type count = index.count(range);
This counts the number of distinct start nodes of all paths matching pattern GATTACA
.
void locate(size_type path, std::vector<node_type>& results, bool append = false, bool sort = true) const
void locate(range_type range, std::vector<node_type>& results, bool append = false, bool sort = true) const
void locate(range_type range, size_type max_positions, std::vector<node_type>& results) const
This query determines all start nodes in the given lexicographic range of path nodes.
-
path
: The lexicographic rank of a single path node. -
range
: A lexicographic range of path nodes. -
max_positions
: Return at most this many start nodes. If there are more, the located subset is chosen randomly. -
results
: The vector where the results should be written to. -
append
: Append the results to the existing contents of the vector instead of clearing it first. -
sort
: Sort the start nodes and remove duplicates.
Example:
std::string p = "GATTACA"; range_type range = index.find(p);
std::vector<node_type> results; index.locate(range, results);
This finds the set of distinct start nodes of all paths matching pattern GATTACA
.
Low-level queries work with comp values and lexicographic ranges of path nodes. They are defined in gcsa.h. These queries generally do not make any sanity checks for the parameter values.
-
size_type size() const
: Returns the number of path nodes in the path graph. -
size_type edgeCount() const
: Returns the number of edges in the path graph. -
size_type order() const
: Returns the order of the path graph. Queries longer than the order may produce false positives (but no false negatives). -
size_type sampleCount() const
: Returns the number of sampled nodes in the index. -
size_type sampleBits() const
: Returns the size of a sample in bits. -
size_type sampledPositions() const
: Returns the number of path nodes containing samples.
-
range_type charRange(comp_type comp) const
: Returns the lexicographic range of paths starting with comp valuecomp
. -
range_type LF(range_type range, comp_type comp) const
: Returns the lexicographic range of predecessors of the lexicographic rangerange
of paths starting with comp valuecomp
. -
size_type LF(size_type path_node) const
: Returns the lexicographic range of a predecessor of the path node with lexicographic rankpath_node
. This should only be used with path nodes with one predecessor (e.g. whensampled(path_node)
is false). -
void LF_fast(range_type range, std::vector<range_type>& results) const
: Finds the lexicographic ranges of predecessors of the lexicographic rangerange
of paths asresults[comp] = LF(range, comp)
for comp valuescomp
encoded using fast bitvectors. By default this means comp values 1 to 4 corresponding to the basesACGT
. -
void LF_all(range_type range, std::vector<range_type>& results) const
: AsLF_fast
, but also include non-technical characters using sparse bitvectors. By default this adds comp value 5 corresponding toN
.
-
bool sampled(size_type path_node) const
: Tells whether the path node with lexicographic rankpath_node
contains samples. -
range_type sampleRange(size_type path_node) const
: Returns the range of sampled nodes for the path node with lexicographic rankpath_node
, assuming that it contains samples. -
inline size_type firstSample(size_type path_node) const
: Returns the position of the first sampled node for the path node with lexicographic rankpath_node
, assuming that it contains samples. -
inline bool lastSample(size_type i) const
: Tells whether the sampled node at positioni
is the last sampled node for a path node. -
node_type sample(size_type i) const
: Returns thei
th sampled node.
The following low-level queries are members of the LCPArray
class, which is defined in lcp.h.
-
size_type size() const
: Returns the number of path nodes in the path graph. -
size_type values() const
: Return the total number of LCP values in the LCP array and the range minimum tree. -
size_type levels()
: Returns the number of levels in the range minimum tree. -
size_type branching()
: Returns the branching factor of the range minimum tree.
-
size_type operator[] (size_type i) const
: Returns thei
th LCP value, which tells the length of the longest common prefix of path nodesi - 1
andi
in lexicographic order. -
STNode root() const
: Returns the root of the suffix tree.
STNode parent(const STNode& node) const
STNode parent(range_type range) const
Returns the parent of the suffix tree node node
or the node corresponding to lexicographic range range
. Stores the string depth of the parent in the returned node.
size_type depth(const STNode& node) const
size_type depth(STNode& node) const
size_type depth(range_type range)
Returns the string depth of the suffix tree node node
or the node corresponding to lexicographic range range
. The string depth of a node is the length of the longest pattern matching the corresponding lexicographic range. If the depth is already stored in node
, the query returns the stored depth. If node
is not const
, the string depth is stored in it.
String depth is not well-defined for leaf nodes, and the query returns STNode::UNKNOWN
. If node
is a leaf node, its string depth can be anything from depth(parent(node)) + 1
to the order of the pruned de Bruijn graph.
Suffix tree nodes correspond to lexicographic ranges of path nodes. They are represented by STNode
objects, which support the following queries.
-
range_type range() const
: Returns the lexicographic range of path nodes corresponding to the node. -
size_type lcp() const
: Returns the LCP value or the string depth of the node. If the string depth has not been stored in the node, the return value isSTNode::UNKNOWN
. See alsoLCPArray::depth()
.
There is also an alias LCPArray::node_type
for STNode
. While it can be confused with gcsa::node_type
, it may be useful for future compatibility with SDSL compressed suffix trees.
The following operators can be used to compare suffix tree nodes to other STNode
objects and lexicographic ranges of path nodes.
bool operator== (const STNode& node) const
bool operator== (range_type range) const
bool operator!= (const STNode& node) const
bool operator!= (range_type range) const
The following is adapted from verifyIndex()
in algorithms.cpp. It determines whether the parent()
operation works correctly with the range matching query string kmer
.
range_type range = index.find(kmer);
if(!Range::empty(range))
{
STNode parent_result = lcp.parent(range);
range_type find_result = range;
std::string::iterator begin = kmer.begin(), end = kmer.end();
while(find_result == range)
{
--end;
find_result = index.find(begin, end);
}
if(parent_result != find_result || parent_result.lcp() != (size_type)(end - begin))
{
std::cerr << "parent" << range << " returned " << parent_result
<< ", expected " << find_result << " at depth " << (end - begin) << std::endl;
}
}