-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compute parsimony branch lengths #18
Comments
Hello Will, I am a little unclear about the context of your question. LibPLL is a low level library, so optimizations like computing branch lengths are done at a higher level. As such this is a bit outside the scope of LibPLL, but fortunately this functionality exists in raxml-ng, if you have the tree and alignment as files. Just pass the tree in using the If you have any more questions, please ask. |
Hello (Tomas?),
Thanks for your rapid response. The context is parsimony trees based on rather large SARS-CoV-2 alignments (40,000 sequences x 30,000 columns); we’re using Goloboff’s "oblong" for parsimony search using sequence sets reduced to parsimony-informative sites and then made unique . The resulting tree gets expanded to include the full sequences, and at that point I need to recompute branch lengths — and I’d thought to use functions in libpll to do that.
If, as you suggest, raxml-ng can compute parsimony branch lengths for these rather large trees, all will be well. Thanks for the suggestion, I’ll try to make it work.
Best regards,
— Will Fischer
…______________________
Will Fischer, PhD
net: wfischer@lanl.gov<mailto:wfischer@lanl.gov>
cell: 505-577-0592
Group T-6, Theoretical Biology
Los Alamos National Laboratory
On Aug 11, 2020, at 12:43 PM, computations <notifications@github.com<mailto:notifications@github.com>> wrote:
Hello Will,
I am a little unclear about the context of your question. LibPLL is a low level library, so optimizations like computing branch lengths are done at a higher level. As such this is a bit outside the scope of LibPLL, but fortunately this functionality exists in raxml-ng<https://github.com/amkozlov/raxml-ng>, if you have the tree and alignment as files. Just pass the tree in using the --tree switch, and add the --evaluate argument as well.
If you have any more questions, please ask.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB2HLA37NHVGTKTEK7MS25TSAGGNJANCNFSM4P3LDFIA>.
|
Hello again,
I had looked at this before — can raxml-ng compute parsimony branch lengths using the --tree and --evaluate flags? I was unable to get that to work.
Best regards,
— Will
On Aug 11, 2020, at 1:04 PM, Fischer, William Mclean <wfischer@lanl.gov<mailto:wfischer@lanl.gov>> wrote:
Hello (Tomas?),
Thanks for your rapid response. The context is parsimony trees based on rather large SARS-CoV-2 alignments (40,000 sequences x 30,000 columns); we’re using Goloboff’s "oblong" for parsimony search using sequence sets reduced to parsimony-informative sites and then made unique . The resulting tree gets expanded to include the full sequences, and at that point I need to recompute branch lengths — and I’d thought to use functions in libpll to do that.
If, as you suggest, raxml-ng can compute parsimony branch lengths for these rather large trees, all will be well. Thanks for the suggestion, I’ll try to make it work.
Best regards,
— Will Fischer
…______________________
Will Fischer, PhD
net: wfischer@lanl.gov<mailto:wfischer@lanl.gov>
cell: 505-577-0592
Group T-6, Theoretical Biology
Los Alamos National Laboratory
On Aug 11, 2020, at 12:43 PM, computations <notifications@github.com<mailto:notifications@github.com>> wrote:
Hello Will,
I am a little unclear about the context of your question. LibPLL is a low level library, so optimizations like computing branch lengths are done at a higher level. As such this is a bit outside the scope of LibPLL, but fortunately this functionality exists in raxml-ng<https://github.com/amkozlov/raxml-ng>, if you have the tree and alignment as files. Just pass the tree in using the --tree switch, and add the --evaluate argument as well.
If you have any more questions, please ask.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB2HLA37NHVGTKTEK7MS25TSAGGNJANCNFSM4P3LDFIA>.
|
Hello @willfischer , I'm not sure what you ask is implemented in raxml-ng. As Ben pointed out, the In general, we have functions in libpll that compute the parsimony score of a tree and such functions are probably used in raxml-ng for quickly creating starting trees using methods like randomized stepwise addition. Parsimony functions implemented in libpll follow the minimum evolution principle as an approximation to ML, and what the functions give is the minimum number of mutations required for the evolutionary process to give the sequences. If I understand your issue correctly, replacing the parsimony informative columns with all columns in the original alignment won't change the score. Best, |
Hello Will, Sorry, I am Ben. I need to be really clear here that raxml-ng will compute branch lengths under a specified substitution model, and so cannot really be considered "parsimony branch lengths". To be honest, I am not entirely sure what you mean by this. I personally have never heard of researchers computing branch lengths under a parsimony assumption, and most of the literature that I can find on this topic suggests that distances are computed under some Markov substitution process, such as JC. If you want the minimum number of substitutions required the JC model should suit your needs for short distances. At this point, I feel obligated to inform you that there is a good chance that even on the SARS-CoV-2 data of underestimated branch lengths when using parsimony, even though the distances between most of the sequences is small. Furthermore, in our own investigations on the SARS-CoV-2 data that the topologies produced by phylogenetic methods are highly suspect. In particular, there are likely to be many trees which either maximize the likelihood. I suspect that in this case, since the distances between the sequences are so small, the most parsimonious tree is similar to the maximum likelihood tree. This is to say, the parsimony surface is likely to exhibit many maxima, and it will be difficult to distinguish between different topologies from the signal present in the sequences alone. Now that I have said that, if you really want the minimum number of substitutions required for a given tree, there is an existing bit of code that I can adapt to do this if you need it. (This was written before @xflouris wrote his example, but the offer still stand if you need it). Best, |
Hi Ben,
Thanks for this thoughtful response. I agree that there are significant difficulties with computing phylogenies in the highly sampled, low mutation rate SARS-CoV-2 datasets. In this context, it is extremely difficult, as you likely recognize, to distinguish between repeated de novo mutations and shared derived mutations, particularly when recombination occurs (as we know it does).
In actual fact, we are not trying to estimate phylogenies that are absolutely correct, but simply to cluster sequences by similarity to make analysis more convenient. It is unquestionable that there will be many most-parsimonious trees, as there will be very many nearly-most-likely trees (the likelihood and parsimony surfaces are both very nearly flat). For all its failings, the parsimony criterion is a simple set of assumptions, not vulnerable to overfitting, and with a substantial history (in the older literature). I’ve been using PAUP* for these purposes, but have run into a hard-coded limit (2**14 taxa) which is impractical to re-code.
In the end, yes, I really really do want (being very specific) to generate, for a given topology, branch lengths defined by the minimum number of substitutions. If you have code that could be trivially adapted to do this, I would be most grateful.
Very best regards,
— Will
On Aug 11, 2020, at 2:02 PM, computations <notifications@github.com<mailto:notifications@github.com>> wrote:
Hello Will,
Sorry, I am Ben. I need to be really clear here that raxml-ng will compute branch lengths under a specified substitution model, and so cannot really be considered "parsimony branch lengths". To be honest, I am not entirely sure what you mean by this. I personally have never heard of researchers computing branch lengths under a parsimony assumption, and most of the literature that I can find on this topic suggests that distances are computed under some Markov substitution process, such as JC. If you want the minimum number of substitutions required the JC model should suit your needs for short distances.
At this point, I feel obligated to inform you that there is a good chance that even on the SARS-CoV-2 data of underestimated branch lengths when using parsimony, even though the distances between most of the sequences is small. Furthermore, in our own investigations<https://www.biorxiv.org/content/10.1101/2020.08.05.239046v1> on the SARS-CoV-2 data that the topologies produced by phylogenetic methods are highly suspect. In particular, there are likely to be many trees which either maximize the likelihood. I suspect that in this case, since the distances between the sequences are so small, the most parsimonious tree is similar to the maximum likelihood tree. This is to say, the parsimony surface is likely to exhibit many maxima, and it will be difficult to distinguish between different topologies from the signal present in the sequences alone.
Now that I have said that, if you really want the minimum number of substitutions required for a given tree, there is an existing bit of code that I can adapt to do this if you need it.
Best,
Ben
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB2HLA35B7HOHOR44KZIHJ3SAGPUHANCNFSM4P3LDFIA>.
|
Hello Tomas,
Thanks for this response — it is a pretty fair statement of my desires. As you say, the internal branches won’t change for the full dataset, and the lengths of each terminal branch need to be incremented to account for the non-informative mutations in each sequence. My problem is that I do not in fact have the branch lengths, and would like to compute them for the given topology.
I think I can put together some code using "set_missing_branch_length" (from examples/newick-fasta-rooted/newick-fasta-rooted.c in libpll-2), but (1) I’m not terribly adept with C, and (2) it isn’t clear how to compute the branch lengths — which was really my question. Can you give me any guidance on this approach?
On the other hand, Ben suggested he has code he can adapt for this, and I’d be thrilled if that works — so I can stop flailing and get back to data analysis ...
Thanks for your thoughts,
— Will
…______________________
Will Fischer, PhD
net: wfischer@lanl.gov<mailto:wfischer@lanl.gov>
cell: 505-577-0592
Group T-6, Theoretical Biology
Los Alamos National Laboratory
On Aug 11, 2020, at 2:00 PM, Tomas Flouri <notifications@github.com<mailto:notifications@github.com>> wrote:
Hello @willfischer<https://github.com/willfischer> ,
I'm not sure what you ask is implemented in raxml-ng. As Ben pointed out, the --evaluate can be used in raxml-ng to evaluate the tree, but I think it's only available for ML.
In general, we have functions in libpll that compute the parsimony score of a tree and such functions are probably used in raxml-ng for quickly creating starting trees using methods like randomized stepwise addition.
Parsimony functions implemented in libpll follow the minimum evolution principle as an approximation to ML, and what the functions give is the minimum number of mutations required for the evolutionary process to give the sequences. If I understand your issue correctly, replacing the parsimony informative columns with all columns in the original alignment won't change the score.
I suspect what you are interested in is some kind of expected number of mutations per site as branch lengths, which you probably already have for the reduced alignment, computed as y=branch_mutations / sites. Adding x constant sites makes sense and you can easily obtain the new branches as y'=y * sites/x since no mutations are introduced. For other uninformative sites e.g. aaaabcd choose the character appearing most times ('a' in the example) and add one mutation to each terminal branch corresponding to the other characters (b c and d). Finally, compute the expected mutations in a similar way e.g. y'=(y*sites + new_mutations)/x where x is the total number of sites.
Does it make sense what I'm writing?
Best,
Tomas
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB2HLAZTIZ5GMA7PLJBGLCDSAGPPFANCNFSM4P3LDFIA>.
|
In regards to the choice of parsimony, I think it is a fine choice for this application (I did some snooping and found what I strongly suspect is your preprint for this paper), but I just wanted to be sure you were aware of the problems. I wish you luck in your endeavors! Here is a quick little program that was adapted from @xflouris examples. I didn't work to hard on it, so I don't offer any strong guarantees here. To build it, in the root directory of libpll run
where I know this is a bit rough, but I hope this helps! EDIT: the previous code posted here computed the total parsimony score for the subtree. This new version computes the marginal parsimony score instead.
|
Thanks, Ben,
I’ll build this and check it out!
Very much obliged!
— Will
On Aug 12, 2020, at 1:58 AM, computations <notifications@github.com<mailto:notifications@github.com>> wrote:
Here is a quick little program that was adapted from @xflouris<https://github.com/xflouris> examples. I didn't work to hard on it, so I don't offer any strong guarantees here. To build it, in the root directory of libpll run
cmake -Bbuild -H.
pushd build
make
popd
gcc npr-pars.c -Isrc build/src/libpll.a -lm
where npr-pars.c will be what you choose to name the code below. A few caveats: It only works on rooted binary trees, and it only accepts phylip format files. The argument order is [tree] [alignment].
I know this is a bit rough, but I hope this helps!
Ben
/*
Copyright (C) 2015 Tomas Flouri
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Contact: Tomas Flouri <Tomas.Flouri@h-its.org<mailto:Tomas.Flouri@h-its.org>>,
Exelixis Lab, Heidelberg Instutute for Theoretical Studies
Schloss-Wolfsbrunnenweg 35, D-69118 Heidelberg, Germany
*/
#include "pll.h"
#include <search.h>
#include <stdarg.h>
#include <time.h>
#define STATES 4
#define MAP pll_map_nt
static void fatal(const char *format, ...) __attribute__((noreturn));
static void *xmalloc(size_t size) {
void *t;
t = malloc(size);
if (!t)
fatal("Unable to allocate enough memory.");
return t;
}
static char *xstrdup(const char *s) {
size_t len = strlen(s);
char *p = (char *)xmalloc(len + 1);
return strcpy(p, s);
}
/* a callback function for performing a full traversal */
static int cb_full_traversal(pll_rnode_t *node) { return 1; }
static void fatal(const char *format, ...) {
va_list argptr;
va_start(argptr, format);
vfprintf(stderr, format, argptr);
va_end(argptr);
fprintf(stderr, "\n");
exit(EXIT_FAILURE);
}
int main(int argc, char *argv[]) {
unsigned int i, j, n;
unsigned int tip_nodes_count, inner_nodes_count, nodes_count, branch_count;
unsigned int ops_count;
pll_parsimony_t *pars;
pll_pars_buildop_t *operations;
pll_pars_recop_t *recops;
pll_rnode_t **travbuffer;
/* we accept only two arguments - the newick tree (unrooted binary) and the
alignment in the form of PHYLIP reads */
if (argc != 3)
fatal(" syntax: %s [newick] [phylip]", argv[0]);
/* parse the unrooted binary tree in newick format, and store the number
of tip nodes in tip_nodes_count */
pll_rtree_t *tree = pll_rtree_parse_newick(argv[1]);
if (!tree)
fatal("Tree must be a rooted binary tree");
/* compute and show node count information */
tip_nodes_count = tree->tip_count;
inner_nodes_count = tip_nodes_count - 1;
nodes_count = inner_nodes_count + tip_nodes_count;
branch_count = nodes_count - 1;
printf("Number of tip/leaf nodes in tree: %d\n", tip_nodes_count);
printf("Number of inner nodes in tree: %d\n", inner_nodes_count);
printf("Total number of nodes in tree: %d\n", nodes_count);
printf("Number of branches in tree: %d\n", branch_count);
/* Uncomment to display the parsed tree ASCII tree together with information
as to which CLV index, branch length and label is associated with each
node. The code will also write (and print on screen) the newick format
of the tree.
pll_utree_show_ascii(tree->nodes[nodes_count-1],
PLL_UTREE_SHOW_LABEL |
PLL_UTREE_SHOW_BRANCH_LENGTH |
PLL_UTREE_SHOW_CLV_INDEX);
char * newick = pll_utree_export_newick(tree->nodes[nodes_count-1],NULL);
printf("%s\n", newick);
free(newick);
*/
/* create a libc hash table of size tip_nodes_count */
hcreate(tip_nodes_count);
/* populate a libc hash table with tree tip labels */
unsigned int *data =
(unsigned int *)xmalloc(tip_nodes_count * sizeof(unsigned int));
for (i = 0; i < tip_nodes_count; ++i) {
data[i] = tree->nodes[i]->clv_index;
ENTRY entry;
#ifdef __APPLE__
entry.key = xstrdup(tree->nodes[i]->label);
#else
entry.key = tree->nodes[i]->label;
#endif
entry.data = (void *)(data + i);
hsearch(entry, ENTER);
}
/* read PHYLIP alignment */
pll_phylip_t *fd = pll_phylip_open(argv[2], pll_map_phylip);
if (!fd)
fatal(pll_errmsg);
pll_msa_t *msa = pll_phylip_parse_interleaved(fd);
if (!msa)
fatal(pll_errmsg);
pll_phylip_close(fd);
if ((unsigned int)(msa->count) != tip_nodes_count)
fatal("Number of sequences does not match number of leaves in tree");
/* create the PLL parsimony instance
parameters:
tips : number of tip sequences we want to have
states : number of states that our data have
sites : number of sites
score_matrix : score matrix to use
score_buffers: number of score buffers to allocate (typically one per
inner node)
ancestral_buffers: number of ancestral state buffers to allocate
(typically one per inner node)
*/
/* set a matrix where mutations are penalized equally by 1 */
double score_matrix[STATES * STATES];
for (i = 0; i < STATES * STATES; ++i)
score_matrix[i] = 1;
for (i = 0; i < STATES; ++i)
score_matrix[i * STATES + i] = 0;
pars =
pll_parsimony_create(tip_nodes_count, STATES, (unsigned int)(msa->length),
score_matrix, inner_nodes_count, inner_nodes_count);
/* find sequences in hash table and link them with the corresponding taxa */
for (i = 0; i < tip_nodes_count; ++i) {
ENTRY query;
query.key = msa->label[i];
ENTRY *found = NULL;
found = hsearch(query, FIND);
if (!found)
fatal("Sequence with header %s does not appear in the tree",
msa->label[i]);
unsigned int tip_clv_index = *((unsigned int *)(found->data));
//printf("Setting sequence: %s\n", msa->sequence[i]);
pll_set_parsimony_sequence(pars, tip_clv_index, MAP, msa->sequence[i]);
}
pll_msa_destroy(msa);
/* destroy hash table */
hdestroy();
/* we no longer need these two arrays (keys and values of hash table... */
free(data);
/* allocate a buffer for storing pointers to nodes of the tree in postorder
traversal */
travbuffer = (pll_rnode_t **)xmalloc(nodes_count * sizeof(pll_rnode_t *));
operations = (pll_pars_buildop_t *)xmalloc(inner_nodes_count *
sizeof(pll_pars_buildop_t));
/* perform a postorder traversal of the rooted tree */
unsigned int traversal_size;
if (!pll_rtree_traverse(tree->root, PLL_TREE_TRAVERSE_POSTORDER,
cb_full_traversal, travbuffer, &traversal_size))
fatal("Function pll_rtree_traverse() requires inner nodes as parameters");
/* given the computed traversal descriptor, generate the build operations
structure */
pll_rtree_create_pars_buildops(travbuffer, traversal_size, operations,
&ops_count);
//printf("Traversal size: %d\n", traversal_size);
//printf("Operations: %d\n", ops_count);
double minscore = pll_parsimony_build(pars, operations, ops_count);
printf("Minimum parsimony score: %f\n", minscore);
for (i = 0; i < traversal_size; ++i) {
unsigned int id = travbuffer[i]->clv_index;
uint64_t parscore = pll_parsimony_score(pars, id);
travbuffer[i]->length = parscore;
}
char* newick_string = pll_rtree_export_newick(tree->root, 0);
printf("%s\n", newick_string);
/* free traversal, build operations and reconstruct operations */
free(travbuffer);
free(operations);
free(newick_string);
/* we will no longer need the tree structure */
pll_rtree_destroy(tree, NULL);
/* destroy the parsimony structure */
pll_parsimony_destroy(pars);
return (EXIT_SUCCESS);
}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB2HLAYX7RQKU7NUTOF4MLTSAJDTFANCNFSM4P3LDFIA>.
|
Hello again, Ben,
I’m sorry to bother you again, but the output from npr-pars is not what I’d expect. It compiles and runs cleanly, but the computed branch lengths are odd.
With the attached toy example, I’d expect this output (see also attached tenseqs.tree.pdf) — this is from PAUP*:
(seq01:0,(((seq02:1,seq03:1):1,(seq04:0,(seq05:0,seq06:0):1):1):1,(seq07:1,(seq08:0,(seq09:0,seq10:1):1):1):3):1);
but the code produces this (right-of-decimal zeros removed from branch lengths; see also attached tenseqs.npr_pars.tree.pdf):
(seq01:0,(((seq02:0,seq03:0):2,(seq04:0,(seq05:0,seq06:0):0):1):5,(seq07:0,(seq08:0,(seq09:0,seq10:0):1):2):4):13):14;
So it looks like the postorder traversal is adding up changes from tips to root (but skipping the terminal-branch changes). Since I’m simple-minded, I tried changing to a preorder traversal, but it only populated a few of the internal branch lengths.
tenseq.tree and tenseqs.tree.pdf are what I’d hope to get — tenseqs.npr_pars.tree and tenseqs.npr_pars.tree.pdf are what I get.
I realize you are helping out of the goodness of your heart — is there enough goodness left to help a little further?
Thanks again and best regards,
— Will
[cid:9edc767d-bfd2-47f4-befc-9a10011fac6c@win.lanl.gov]
On Aug 12, 2020, at 8:41 AM, Fischer, William Mclean <wfischer@lanl.gov<mailto:wfischer@lanl.gov>> wrote:
Thanks, Ben,
I’ll build this and check it out!
Very much obliged!
— Will
On Aug 12, 2020, at 1:58 AM, computations <notifications@github.com<mailto:notifications@github.com>> wrote:
Here is a quick little program that was adapted from @xflouris<https://github.com/xflouris> examples. I didn't work to hard on it, so I don't offer any strong guarantees here. To build it, in the root directory of libpll run
cmake -Bbuild -H.
pushd build
make
popd
gcc npr-pars.c -Isrc build/src/libpll.a -lm
where npr-pars.c will be what you choose to name the code below. A few caveats: It only works on rooted binary trees, and it only accepts phylip format files. The argument order is [tree] [alignment].
I know this is a bit rough, but I hope this helps!
Ben
/*
Copyright (C) 2015 Tomas Flouri
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Contact: Tomas Flouri <Tomas.Flouri@h-its.org<mailto:Tomas.Flouri@h-its.org>>,
Exelixis Lab, Heidelberg Instutute for Theoretical Studies
Schloss-Wolfsbrunnenweg 35, D-69118 Heidelberg, Germany
*/
#include "pll.h"
#include <search.h>
#include <stdarg.h>
#include <time.h>
#define STATES 4
#define MAP pll_map_nt
static void fatal(const char *format, ...) __attribute__((noreturn));
static void *xmalloc(size_t size) {
void *t;
t = malloc(size);
if (!t)
fatal("Unable to allocate enough memory.");
return t;
}
static char *xstrdup(const char *s) {
size_t len = strlen(s);
char *p = (char *)xmalloc(len + 1);
return strcpy(p, s);
}
/* a callback function for performing a full traversal */
static int cb_full_traversal(pll_rnode_t *node) { return 1; }
static void fatal(const char *format, ...) {
va_list argptr;
va_start(argptr, format);
vfprintf(stderr, format, argptr);
va_end(argptr);
fprintf(stderr, "\n");
exit(EXIT_FAILURE);
}
int main(int argc, char *argv[]) {
unsigned int i, j, n;
unsigned int tip_nodes_count, inner_nodes_count, nodes_count, branch_count;
unsigned int ops_count;
pll_parsimony_t *pars;
pll_pars_buildop_t *operations;
pll_pars_recop_t *recops;
pll_rnode_t **travbuffer;
/* we accept only two arguments - the newick tree (unrooted binary) and the
alignment in the form of PHYLIP reads */
if (argc != 3)
fatal(" syntax: %s [newick] [phylip]", argv[0]);
/* parse the unrooted binary tree in newick format, and store the number
of tip nodes in tip_nodes_count */
pll_rtree_t *tree = pll_rtree_parse_newick(argv[1]);
if (!tree)
fatal("Tree must be a rooted binary tree");
/* compute and show node count information */
tip_nodes_count = tree->tip_count;
inner_nodes_count = tip_nodes_count - 1;
nodes_count = inner_nodes_count + tip_nodes_count;
branch_count = nodes_count - 1;
printf("Number of tip/leaf nodes in tree: %d\n", tip_nodes_count);
printf("Number of inner nodes in tree: %d\n", inner_nodes_count);
printf("Total number of nodes in tree: %d\n", nodes_count);
printf("Number of branches in tree: %d\n", branch_count);
/* Uncomment to display the parsed tree ASCII tree together with information
as to which CLV index, branch length and label is associated with each
node. The code will also write (and print on screen) the newick format
of the tree.
pll_utree_show_ascii(tree->nodes[nodes_count-1],
PLL_UTREE_SHOW_LABEL |
PLL_UTREE_SHOW_BRANCH_LENGTH |
PLL_UTREE_SHOW_CLV_INDEX);
char * newick = pll_utree_export_newick(tree->nodes[nodes_count-1],NULL);
printf("%s\n", newick);
free(newick);
*/
/* create a libc hash table of size tip_nodes_count */
hcreate(tip_nodes_count);
/* populate a libc hash table with tree tip labels */
unsigned int *data =
(unsigned int *)xmalloc(tip_nodes_count * sizeof(unsigned int));
for (i = 0; i < tip_nodes_count; ++i) {
data[i] = tree->nodes[i]->clv_index;
ENTRY entry;
#ifdef __APPLE__
entry.key = xstrdup(tree->nodes[i]->label);
#else
entry.key = tree->nodes[i]->label;
#endif
entry.data = (void *)(data + i);
hsearch(entry, ENTER);
}
/* read PHYLIP alignment */
pll_phylip_t *fd = pll_phylip_open(argv[2], pll_map_phylip);
if (!fd)
fatal(pll_errmsg);
pll_msa_t *msa = pll_phylip_parse_interleaved(fd);
if (!msa)
fatal(pll_errmsg);
pll_phylip_close(fd);
if ((unsigned int)(msa->count) != tip_nodes_count)
fatal("Number of sequences does not match number of leaves in tree");
/* create the PLL parsimony instance
parameters:
tips : number of tip sequences we want to have
states : number of states that our data have
sites : number of sites
score_matrix : score matrix to use
score_buffers: number of score buffers to allocate (typically one per
inner node)
ancestral_buffers: number of ancestral state buffers to allocate
(typically one per inner node)
*/
/* set a matrix where mutations are penalized equally by 1 */
double score_matrix[STATES * STATES];
for (i = 0; i < STATES * STATES; ++i)
score_matrix[i] = 1;
for (i = 0; i < STATES; ++i)
score_matrix[i * STATES + i] = 0;
pars =
pll_parsimony_create(tip_nodes_count, STATES, (unsigned int)(msa->length),
score_matrix, inner_nodes_count, inner_nodes_count);
/* find sequences in hash table and link them with the corresponding taxa */
for (i = 0; i < tip_nodes_count; ++i) {
ENTRY query;
query.key = msa->label[i];
ENTRY *found = NULL;
found = hsearch(query, FIND);
if (!found)
fatal("Sequence with header %s does not appear in the tree",
msa->label[i]);
unsigned int tip_clv_index = *((unsigned int *)(found->data));
//printf("Setting sequence: %s\n", msa->sequence[i]);
pll_set_parsimony_sequence(pars, tip_clv_index, MAP, msa->sequence[i]);
}
pll_msa_destroy(msa);
/* destroy hash table */
hdestroy();
/* we no longer need these two arrays (keys and values of hash table... */
free(data);
/* allocate a buffer for storing pointers to nodes of the tree in postorder
traversal */
travbuffer = (pll_rnode_t **)xmalloc(nodes_count * sizeof(pll_rnode_t *));
operations = (pll_pars_buildop_t *)xmalloc(inner_nodes_count *
sizeof(pll_pars_buildop_t));
/* perform a postorder traversal of the rooted tree */
unsigned int traversal_size;
if (!pll_rtree_traverse(tree->root, PLL_TREE_TRAVERSE_POSTORDER,
cb_full_traversal, travbuffer, &traversal_size))
fatal("Function pll_rtree_traverse() requires inner nodes as parameters");
/* given the computed traversal descriptor, generate the build operations
structure */
pll_rtree_create_pars_buildops(travbuffer, traversal_size, operations,
&ops_count);
//printf("Traversal size: %d\n", traversal_size);
//printf("Operations: %d\n", ops_count);
double minscore = pll_parsimony_build(pars, operations, ops_count);
printf("Minimum parsimony score: %f\n", minscore);
for (i = 0; i < traversal_size; ++i) {
unsigned int id = travbuffer[i]->clv_index;
uint64_t parscore = pll_parsimony_score(pars, id);
travbuffer[i]->length = parscore;
}
char* newick_string = pll_rtree_export_newick(tree->root, 0);
printf("%s\n", newick_string);
/* free traversal, build operations and reconstruct operations */
free(travbuffer);
free(operations);
free(newick_string);
/* we will no longer need the tree structure */
pll_rtree_destroy(tree, NULL);
/* destroy the parsimony structure */
pll_parsimony_destroy(pars);
return (EXIT_SUCCESS);
}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB2HLAYX7RQKU7NUTOF4MLTSAJDTFANCNFSM4P3LDFIA>.
|
Hello Will and Ben, Ben, if you have some free time and want to do it, I can write you some more information via email. You will basically need to distinguish between parsimony informative and parsimony uninformative sites. Uninformative sites which are constant contribute no mutations to the lineages. Other uninformative sites contribute one mutation to the terminal branches that correspond to characters (states) that appear only once. You need to sum the number of mutations at each branch, and divide by the number of sites to get the branch lengths. I think it should be an hour or two of work. Otherwise, I can do it next week. I'm busy with some grant writing atm, so I think I will have time on monday/tuesday. |
Hello Tomas,
I think you’re correct in that Ben’s code is summing changes and reporting, for each node, the number of changes in its descendent (child) nodes. This is the algorithm for computing the parsimony score, but not what I need.
I greatly appreciate your or Ben’s "hour or two" of work — especially at grant-writing time.
Very much obliged,
— Will
On Aug 13, 2020, at 3:00 PM, Tomas Flouri <notifications@github.com<mailto:notifications@github.com>> wrote:
Hello Will and Ben,
I haven't checked the code Ben wrote, but I suspect it must derive from the parsimony example which computes the parsimony score of the tree. I think this is not needed since Will has already done the inference, and the branches are what matter here.
Ben, if you have some free time and want to do it, I can write you some more information. You will basically need to distinguish between parsimony informative and parsimony uninformative sites. Uninformative sites which are constant contribute no mutations to the lineages. Other uninformative sites contribute one mutation to the terminal branches that correspond to characters (states) that appear only once. You need to sum the number of mutations at each branch, and divide by the number of sites to get the branch lengths. I think it should be an hour or two of work.
Otherwise, I can do it next. I'm busy with some grant writing atm, so I think I will have time on monday/tuesday.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB2HLA4J2BMG3MZPL4C3ARTSARH6XANCNFSM4P3LDFIA>.
|
Hello,
I noticed one of the bugs pretty quickly after posting the code, there is a
fixed version on the GitHub issue. When I did this, I just edited the
GitHub issue, and I guess that it didn't send you the update? The fix stops
the double counting of edits. Unfortunately, I didnt spend any time
assigning the edit to particular sequences.
Regardless, it does seem like what I posted is only roughly what you want.
What I ended up computing is the number of edits that would be required at
a particular node.
At the moment, I am about to go on vacation, so the chance of me getting to
this before Thomas is basically zero, but we will see.
Don't worry about it, sorry I ended up not helping much :)
Best,
…On Fri, Aug 14, 2020, 00:30 willfischer ***@***.***> wrote:
Hello Tomas,
I think you’re correct in that Ben’s code is summing changes and
reporting, for each node, the number of changes in its descendent (child)
nodes. This is the algorithm for computing the parsimony score, but not
what I need.
I greatly appreciate your or Ben’s "hour or two" of work — especially at
grant-writing time.
Very much obliged,
— Will
On Aug 13, 2020, at 3:00 PM, Tomas Flouri ***@***.***
***@***.***>> wrote:
Hello Will and Ben,
I haven't checked the code Ben wrote, but I suspect it must derive from
the parsimony example which computes the parsimony score of the tree. I
think this is not needed since Will has already done the inference, and the
branches are what matter here.
Ben, if you have some free time and want to do it, I can write you some
more information. You will basically need to distinguish between parsimony
informative and parsimony uninformative sites. Uninformative sites which
are constant contribute no mutations to the lineages. Other uninformative
sites contribute one mutation to the terminal branches that correspond to
characters (states) that appear only once. You need to sum the number of
mutations at each branch, and divide by the number of sites to get the
branch lengths. I think it should be an hour or two of work.
Otherwise, I can do it next. I'm busy with some grant writing atm, so I
think I will have time on monday/tuesday.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<
#18 (comment)>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/AB2HLA4J2BMG3MZPL4C3ARTSARH6XANCNFSM4P3LDFIA>.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#18 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC45GSB64HOFDIRQQZO6W2LSARSQPANCNFSM4P3LDFIA>
.
|
No worries, thanks for your efforts — and have a nice vacation!
— Will
On Aug 13, 2020, at 5:22 PM, computations <notifications@github.com<mailto:notifications@github.com>> wrote:
Hello,
I noticed one of the bugs pretty quickly after posting the code, there is a
fixed version on the GitHub issue. When I did this, I just edited the
GitHub issue, and I guess that it didn't send you the update? The fix stops
the double counting of edits. Unfortunately, I didnt spend any time
assigning the edit to particular sequences.
Regardless, it does seem like what I posted is only roughly what you want.
What I ended up computing is the number of edits that would be required at
a particular node.
At the moment, I am about to go on vacation, so the chance of me getting to
this before Thomas is basically zero, but we will see.
Don't worry about it, sorry I ended up not helping much :)
Best,
On Fri, Aug 14, 2020, 00:30 willfischer ***@***.******@***.***>> wrote:
Hello Tomas,
I think you’re correct in that Ben’s code is summing changes and
reporting, for each node, the number of changes in its descendent (child)
nodes. This is the algorithm for computing the parsimony score, but not
what I need.
I greatly appreciate your or Ben’s "hour or two" of work — especially at
grant-writing time.
Very much obliged,
— Will
On Aug 13, 2020, at 3:00 PM, Tomas Flouri ***@***.******@***.***>
***@***.***>> wrote:
Hello Will and Ben,
I haven't checked the code Ben wrote, but I suspect it must derive from
the parsimony example which computes the parsimony score of the tree. I
think this is not needed since Will has already done the inference, and the
branches are what matter here.
Ben, if you have some free time and want to do it, I can write you some
more information. You will basically need to distinguish between parsimony
informative and parsimony uninformative sites. Uninformative sites which
are constant contribute no mutations to the lineages. Other uninformative
sites contribute one mutation to the terminal branches that correspond to
characters (states) that appear only once. You need to sum the number of
mutations at each branch, and divide by the number of sites to get the
branch lengths. I think it should be an hour or two of work.
Otherwise, I can do it next. I'm busy with some grant writing atm, so I
think I will have time on monday/tuesday.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<
#18 (comment)>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/AB2HLA4J2BMG3MZPL4C3ARTSARH6XANCNFSM4P3LDFIA>.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#18 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC45GSB64HOFDIRQQZO6W2LSARSQPANCNFSM4P3LDFIA>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB2HLA3UO5OGVSBON4DOVRLSARYSLANCNFSM4P3LDFIA>.
|
Hello Will, I have added a new function called Branch lengths are then normalized by the number of sites in the main loop of the program. To measure how long a run will take, you can run the program for a small portion of the alignment and extrapolate for the full alignment. Let me know if you have any questions, perhaps we can move the discussion to email: t.flouris@ucl.ac.uk , although I will be quite busy for the next two weeks. Tomas |
Tomas,
Thanks very much for this! Two weeks is a good span of time for me to play with this — so I will leave you alone for the moment (smile) and check back with you once your immediate crisis is past.
Best regards,
— Will
On Aug 17, 2020, at 2:49 PM, Tomas Flouri <notifications@github.com<mailto:notifications@github.com>> wrote:
Hello Will,
attached below is a modification of the parsimony example from here: https://github.com/xflouris/libpll/tree/master/examples/parsimony
I have added a new function called comp_branch_lenghts() that computes the branch lengths (number of mutations on a branch). Please note that optimal branch lengths are not unique as there may exist many sets of branch lengths that yield the same cost, particularly when the tree tends to be symmetrical. Consider for example the tree ((A,B),(C,D)); and a single site AACC. Either of the internal branch lengths may have length 1 and the other 0.
Branch lengths are then normalized by the number of sites in the main loop of the program. To measure how long a run will take, you can run the program for a small portion of the alignment and extrapolate for the full alignment.
Let me know if you have any questions, perhaps we can move the discussion to email: t.flouris@ucl.ac.uk<mailto:t.flouris@ucl.ac.uk> , although I will be quite busy for the next two weeks.
Tomas
pars-bl.c.gz<https://github.com/xflouris/libpll-2/files/5086674/pars-bl.c.gz>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB2HLA32QGMNZIW3EUIOXLTSBGJVHANCNFSM4P3LDFIA>.
|
Is it possible to add parsimony branch lengths to a topology-only tree?
This would be a useful example.
e.g. for the following tree:
(seq1,(seq2,(seq4,(seq3,(seq5,seq6)))));
and sequences:
The text was updated successfully, but these errors were encountered: