Skip to content
Jouni Siren edited this page Jan 29, 2018 · 6 revisions

Both formats

  • All paths must have the same length. Paths ending at the sink node must be padded with $s.
  • Using excessively large node identifiers is probably a bad idea. The identifiers of different nodes do not have to form a contiguous range.
  • GCSA does not use node orientations, so the orientations can also be encoded in node identifiers. It is probably a bad idea to use the reverse complement orientation for the source node and the sink node.
  • Node offsets can be from 0 to 1023. The offsets must start from 0 and be contiguous for the same node of the original graph.
  • If node orientations are used, offsets 0 to k - 1 in the forward strand correspond to offsets k - 1 to 0 in the reverse complement strand, in that order.

Binary format (default)

General

  • The extension for a graph in binary format is .graph.
  • Every field is an unsigned 64-bit integer. The bulk of the file is in native GCSA input format.
  • Each input file consists of one or more sections. Each section starts with a header, followed by data. Multiple files with the same kmer length can be concatenated into a single file.

Header

  • One field for future use. In the current version, its value must be 0.
  • The number of paths in the section.
  • The length of the paths.

Data

  • The paths as KMer structures in an arbitrary order.
  • Each structure consists of three fields: key, from, and to.

key

  • Created by Key::encode(const Alphabet& alpha, const std::string& kmer, byte_type pred, byte_type succ).
  • alpha is an object that maps between character values and comp values (the internal alphabet of GCSA). The default constructor of Alphabet creates an object that maps $ACGTN# to 0-6.
  • kmer is the fixed-length label of the path.
  • pred encodes the labels of the nodes preceding the path. pred & (1 << comp) is true iff comp is a comp value for one of the labels.
  • succ encodes the set of character labels at the extension nodes of the path.

from

  • Encodes the start node of the path.
  • Created by Node::encode(size_type node_id, size_type node_offset) or Node::encode(size_type node_id, size_type node_offset, bool reverse_complement).
  • node_id is the identifier of the corresponding node in the original graph. The identifier uses the highest 53 bits.
  • reverse_complement tells the orientation of the node. Bit 10 is used to store the orientation, with forward orientation encoded as 0 and reverse complement encoded as 1.
  • node_offset is the start offset (in the label of the node of the original graph) for the path. We use the lower 10 bits.

to

  • Encodes an extension node of the path in the same way as from.
  • If the path has multiple extension nodes, separate KMer objects with different to fields must be created for each of them. The succ field of the key may encode all extension node labels or just the one corresponding to this object, whichever is more convenient.
  • If the path cannot be extended because it ends at the sink node (it ends with one or more $s), it must be marked with KMer::makeSorted(). This is automatically done when the file is loaded with InputGraph.

Text format (for debugging purposes)

General

  • The extension for a graph in text format is .gcsa2. This will probably change in the future.
  • Each path is encoded as one or more lines. Each line consists of five fields, separated by tabs: path label, start node, predecessor labels, extension node labels, and extension nodes.
  • Because parsing each line takes a few microseconds, the text format is only suitable for debugging purposes.
  • Multiple files can be concatenated to form a single file.

Label

  • The label is encoded as a byte string.

Nodes

  • The position corresponding to a node is encoded as id:offset (for forward orientation) or as id:-offset (for reverse complement orientation).
  • See the binary format for further information about the identifier, the offset, and the orientation.
  • If a path has multiple extension nodes, the nodes can be separated by commas or encoded on different lines.

Predecessor and extension labels

  • The labels are encoded as strings of comma-separated characters.
  • If a path is represented as multiple lines, each line can have all possible extension labels or only those corresponding to the extension nodes on that line, whichever is more convenient.

Node mapping

General

  • Every field is an unsigned 64-bit integer.
  • The mapping is a function f from node identifiers to node identifiers in the original graph such that:
    • f(x) = x for all x < first_node.
    • f(x) is defined separately for all x >= first_node.

Header

  • The identifier first_node of the first node that has a mapping.
  • The first unassigned node identifier.

Body

  • For each node identifier that has a mapping in sorted order: the identifier the node maps to.