gretl - Graph evaluation toolkit

gretl is a tool for basic graph statistics using GFA format input. Our statistics are based on nodes, edges and paths/walks. Walks can also be used, but will be represented as paths internally. Many commands do not work without paths/walk information.
In addition, we added some commands, using graph-based statistics, but represent it on a path (sample scale). We also offer commands for more complex analysis - an overview can be seen below.

Requirements on GFA file:

GFA format v1.0, v1.1 or v1.2.
Numerical node ID

Comment:

Sorted node IDs (in 1D SGD) are not required, but all "jump" related statistics, the block and nwindow subcommand will be based on the order of the nodes in the GFA file. Check this paper for more information.
- Run odgi sort -Y to sort the graph in pan-genomic order.
We recommend dense node ID, starting at 1 and end at the number of nodes +1. Memory efficient on multiple levels.

Prepare your graph (optional)

This is normally not needed if your graph is constructed by PGGB, Minigraph-Cactus or vg construct

# Convert string-based node IDs to numerical IDs
./target/release/gretl id2int -g /path/to/old_grapg.gfa -o /path/to/new_graph.gfa

# Sort the graph in pan-genomic order
odgi sort -Y -i /path/to/new_graph.gfa -o /path/to/new_graph_sorted.og
# Convert the ODGI file back to GFA
odgi view -i /path/to/new_graph_sorted.og -g > /path/to/new_graph_sorted.gfa

Plotting your results / Follow-along documentation

We have added python scripts to visualize the output of the different commands of this tool. The scripts are stored in the scripts directory. Almost all commands of gretl are covered, some even multiple times. For better understanding we added a follow-along markdown with examples, which highlights research questions and ideas can be answered.

Explanation

Since some of the commands run relative complex computation, we have added a detailed description of each subcommand in the doc directory, which can be found here. It covers how internal each command is structured, what is calculated and how this calculation is performed (very roughly).

Installation:

Github

git clone https://github.com/MoinSebi/gretl  
cd gretl   
cargo build --release

Running gretl:
A list of all subcommands is at the bottom

./target/release/gretl [subcommand] [options]

From Bioconda channel

Gretl is available on Bioconda. You need to install conda/mamba first. We recommend creating a new environment for gretl to avoid conflicts with other packages conda create -n gretl_env && conda activate gretl_env.

mamba install -c conda-forge -c bioconda gretl

Testing

We provide a small test suite to test the basic functionality of the tool.

cargo test

Performance

We tried to optimize the code as much as possible. Most steps are dependent on the number of nodes and the length of the paths. Runtime should be linear in most cases. All commands can be multithreaded, which can be controlled by the -t option.

Usage

Stats

Calculate statistics of a graph in GFA format. A list of all stats can be found here. Please consider using the --pansn option to group the paths by sample. Read more information about PanSN-spec here.

In general the stats command has two possible options:

Path-based statistics: Calculate statistics for each path in the graph (-p flag needed)
Graph-based statistics: Calculate statistics for the whole graph, including hybrid statistics based on path statistics (default). These hybrid statistics are the average and standard deviation of all path statistics. A full list of all statistics be found in paper directory in this repository.

Example

/// Graph-based + hybrid statistics
./gretl stats -g /path/to/graph.gfa -o /path/to/output.txt

/// Path-based statistics
./gretl stats -g /path/to/graph.gfa -o /path/to/output.path.txt -p

Result

TSV or YAML file with statistics

id2int

Convert any string-based node identifier to numeric values. Afterward, use odgi sort -Y to sort the graph in pan-genomic order, which will create more meaningful statistics in gretl stats and is required in ````block´´´ subcommand (see above). Nevertheless, numerical node IDs are required by any gretl command. The new node identifiers will start at 1 and end at the number of nodes +1. More information can be found here.

Available options:

-d, --dict <dict> Write a dictionary with the new and old IDs to a plain text file.

Example usage

./gretl id2int -g /path/to/graph.gfa -o /path/to/output.gfa -d /path/to/dict.txt

Result:

GFA file with numerical node IDs

Important: This function will convert all node IDs in the graph. Additional data in tags will not be converted.

Node-list

Individual node statistics based on graph-related features. Statistics provided:

Length
Node Degree
Depth
Core

Comment
The information of the reported table can be used as a individual lookup or to create own window-like statistics (over nodes).

Example usage

./gretl node-list -g /path/to/graph.gfa -o /path/to/output.txt

Result

TSV file output

Node_id	Length	Core	Depth	ND
1	21176	1	1	0
2	15530	1	1	0
3	15530	1	1	0
4	24351	1	1	0
5	24367	1	1	0
6	100	2	2	2
7	1	1	1	1
8	1	1	1	1

Core

Count the amount of sequence and the number of nodes found in each similarity level. The method calculates the similarity for each node and summarize for each possible level the number of nodes and the amount sequence.

Example usage

./gretl core -g /path/to/graph.gfa -o /path/to/output.txt

Path similarity (PS)

Calculate for each path the number of nodes and the amount of sequence at each similarity level.

Example usage

./gretl ps -g /path/to/graph.gfa -o /path/to/output.txt

Example output Core sequence/nodes are those which are traversed by all other samples as well (here: 5). Since nodes can be traversed multiple times, the amount of sequence and the number of nodes can be different in each path.
Type: N = Number of nodes, S = Amount of sequence [bp]

Accession	Type	1	2	3	4	5
ABQ_6.ChrX	N	336	891	1785	6630	7655
BIH_4.ChrX	N	278	764	1753	6592	7655
ABF_6.ChrX	N	2181	1093	1206	5216	7655
BPN_2.ChrX	N	1250	690	916	5170	7672
BCK_8.ChrX	N	3275	948	1062	3734	7655
ABQ_6.ChrX	S	47050	3792	12833	73540	597805
BIH_4.ChrX	S	26389	2445	12195	73030	597805
ABF_6.ChrX	S	33120	5969	12142	71593	597805
BPN_2.ChrX	S	104353	5352	1650	9894	598957
BCK_8.ChrX	S	53334	4054	2600	67558	597805

Bootstrap

We bootstrap different samples of the graph. We iterate from one sample to the maximum number of samples, each time bootstrapping n different combinations. At each iteration, we calculate the number of nodes and the amount of sequence in each similarity level. If you want to exclude speicifc nodes, you can use the --nodes option. The most important parameter is the number of bootstraps, which is defined by the --number option.

Available options:

--nodes <nodes>Run bootstrap only on these nodes
--number <number> Number of bootstrap for each number of genomes

Example usage

./gretl bootstrap -g /path/to/graph.gfa -o /path/to/output.txt -n 20

Example output

Size: Number of samples bootstrapped
Run: Which iteration of this bootstrap-level
Type: N = Number of nodes, S = Amount of sequence
1-5: Number of nodes or amount of sequence in each similarity level (depends on the type)

Size	Run	Type	1	2	3	4	5
2	0	N	12238	10890	0	0	0
2	1	N	7650	13368	0	0	0
2	2	N	10165	11903	0	0	0
3	0	N	7773	7263	9587	0	0
3	1	N	8621	7833	8466	0	0
3	2	N	4370	4041	13077	0	0
4	0	N	6756	2420	6487	9325	0
4	1	N	6227	2303	6249	9275	0
4	2	N	7870	3085	7085	7858	0
5	0	N	7315	2191	2240	6833	7655

Window (sliding, path)

Calculate statistics on a node level (graph- or path-based) and summarize them for each path in a sliding window approach. In detail: Iterate over the nodes of a path (window-like), summarize the stats of all nodes in the window and report a single value for each window.

Example Usage

./gretl window -g /path/to/graph.gfa -o /path/to/output.txt -s 1000 --step 100

Example output Table: Path in col1, similarity values on all the other values (each column is 1000 bp, going 100 bp steps)

ABQ_6.ChrX	5	5	5	5	5	5
BIH_4.ChrX	5	3.5	5	5	5	5
ABF_6.ChrX	5	5	5	5	5	5
BPN_2.ChrX	5	5	5	5	5	5
BCK_8.ChrX	5	5	5	5	5	4.5

Nwindow

For each node we calculate how many other nodes can be reached by a certain number of steps, sequence or jumps. The output is a table with the node ID, the number of nodes, the amount of sequence and the sum of jumps.

Step criteria

Jumps: A jumps is defined as difference between the current and the next node. Your input referees to the sum of all jumps in the window.
Steps: A step it the number of moves we make in the graph. Your input is the maximum steps from the starting node.
Sequence: Limit the window by a sequence threshold. We stop if the sequence length is larger than the provided threshold.

Example usage: How many nodes do I need to collect 1000 bp?

./gretl nwindow -g /path/to/graph.gfa -o /path/to/output.txt --sequence 1000 --node-number

Output: You are able to return the number of collected nodes, the total number of jumps or the total sequence. Some combinations of input limitation and output do not gain any additional information.

Table: NodesID, Number of nodes, amount of sequence and sum of jumps (collected in a window)

nodeid	node	sequence	jumps
240	54	50623	1321
241	53	50589	1296
242	46	1862	696
243	46	1862	709
244	44	1832	637
245	38	1762	458
246	38	1762	463
247	37	1567	410
248	33	575	280
249	33	575	280
250	33	391	256

Block

Statistics of pangenome blocks. Blocks are defined to be a collection of nodes which are in proximity. Proximity is measured by the node id differences, therefore the order of the nodes in the GFA file is important. Blocks normally contain nodes of consecutive node IDs and can be defined by either number of nodes of each block OR total amount of sequence in each block.

Important: This is the only command which requires a sorted node ID space (1D SGD), which can be achieved by odgi sort -Y.

Example usage

./gretl block -g /path/to/graph.gfa -o /path/to/output.txt

Example result

start_end	#Traversal	#Path	#Nodes (sum)	#Nodes (average)	Sequence [bp] (average)
1-101	5	5	5	25.6	12201.2
101-201 2	2	2	66.5	1748.5	1748.5
201-301 4	4	4	57.75	884.0	884.0
301-401 4	4	4	65.25	472.75	472.75
401-501 4	4	4	66.0	418.0	418.0
501-601 4	4	4	52.25	1823.25	1823.25
601-701 3	3	3	69.33333333333333	1646.3333333333333	1646.3333333333333

Find

Find a specific node (e.g. 10), directed node (e.g. 10+), or edge (e.g. 10+20+) in the graph and get the exact (sequence) position in the paths. Output is a BED file with the positions. You are able to add additional sequence (using flag -l) on both sites, which can help if you want to realign to a database and external sequence.

./gretl find -g /path/to/graph.gfa -o /path/to/output.txt --length 1000 -f feature.txt

Example input

10
11
12

Example output BED format. Tags: ID: Node ID, NS: Node start, NB: Node end

BPN_2.ChrX	0	21377	ID:1+;NS:21176;NB:0
BCK_8.ChrX	0	15731	ID:3+;NS:15530;NB:0

Citation

If you use gretl in your work, please cite:
Gretl - Variation GRaph Evaluation TooLkit
Sebastian Vorbrugg, Ilja Bezrukov, Zhigui Bao, Detlef Weigel
doi: https://doi.org/10.1101/2024.03.04.580974

Name		Name	Last commit message	Last commit date
Latest commit History 243 Commits
data		data
doc		doc
paper		paper
plots		plots
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gretl - Graph evaluation toolkit

Requirements on GFA file:

Prepare your graph (optional)

Plotting your results / Follow-along documentation

Explanation

Installation:

Github

From Bioconda channel

Testing

Performance

Usage

Stats

id2int

Node-list

Core

Path similarity (PS)

Bootstrap

Window (sliding, path)

Nwindow

Block

Find

Citation

About

Releases 1

Packages

Contributors 2

Languages

License

MoinSebi/gretl

Folders and files

Latest commit

History

Repository files navigation

gretl - Graph evaluation toolkit

Requirements on GFA file:

Prepare your graph (optional)

Plotting your results / Follow-along documentation

Explanation

Installation:

Github

From Bioconda channel

Testing

Performance

Usage

Stats

id2int

Node-list

Core

Path similarity (PS)

Bootstrap

Window (sliding, path)

Nwindow

Block

Find

Citation

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages