Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute parsimony branch lengths #18

Open
willfischer opened this issue Aug 11, 2020 · 16 comments
Open

compute parsimony branch lengths #18

willfischer opened this issue Aug 11, 2020 · 16 comments

Comments

@willfischer
Copy link

Is it possible to add parsimony branch lengths to a topology-only tree?

This would be a useful example.

e.g. for the following tree:
(seq1,(seq2,(seq4,(seq3,(seq5,seq6)))));

and sequences:

seq1
TTTGACAGCAGCCTAGCATTTCGTCACGTGGCCCGAGAGCTGCATCCGGA
seq2
TTcGACAGCAGCCTAGCATTTCGTCACGTGGCCCGAGAGCTGCATCCGGA
seq3
TTcGACAGCAGCCTAGCAaTTCGTCACGTGGCCCGAGAGCTGCATCCGGA
seq4
TTcGACAGCAGCCTAGCAaTTCGTCACGTGGCCCGAGAGCTGCATCCGGA
seq5
TTcGACAGCAGCCTAGCAaTTCGTCACGTGGCtCGAGAGCTGCATCCGGA
seq6
TTTGACAGCAGCCTAGCAaTTCGTCACGTGGCtCGAGAGCTGCATCCGGA

@computations
Copy link
Collaborator

Hello Will,

I am a little unclear about the context of your question. LibPLL is a low level library, so optimizations like computing branch lengths are done at a higher level. As such this is a bit outside the scope of LibPLL, but fortunately this functionality exists in raxml-ng, if you have the tree and alignment as files. Just pass the tree in using the --tree switch, and add the --evaluate argument as well.

If you have any more questions, please ask.

@willfischer
Copy link
Author

willfischer commented Aug 11, 2020 via email

@willfischer
Copy link
Author

willfischer commented Aug 11, 2020 via email

@xflouris
Copy link
Owner

Hello @willfischer ,

I'm not sure what you ask is implemented in raxml-ng. As Ben pointed out, the --evaluate can be used in raxml-ng to evaluate the tree, but I think it's only available for ML.

In general, we have functions in libpll that compute the parsimony score of a tree and such functions are probably used in raxml-ng for quickly creating starting trees using methods like randomized stepwise addition.

Parsimony functions implemented in libpll follow the minimum evolution principle as an approximation to ML, and what the functions give is the minimum number of mutations required for the evolutionary process to give the sequences. If I understand your issue correctly, replacing the parsimony informative columns with all columns in the original alignment won't change the score.
I suspect what you are interested in is some kind of expected number of mutations per site as branch lengths, which you probably already have for the reduced alignment, computed as y=branch_mutations / sites. Adding x constant sites makes sense and you can easily obtain the new branches as y'=y * sites/x since no mutations are introduced. For other uninformative sites e.g. aaaabcd choose the character appearing most times ('a' in the example) and add one mutation to each terminal branch corresponding to the other characters (b c and d). Finally, compute the expected mutations in a similar way e.g. y'=(y*sites + new_mutations)/x where x is the total number of sites.
Does it make sense what I'm writing?

Best,
Tomas

@computations
Copy link
Collaborator

computations commented Aug 11, 2020

Hello Will,

Sorry, I am Ben. I need to be really clear here that raxml-ng will compute branch lengths under a specified substitution model, and so cannot really be considered "parsimony branch lengths". To be honest, I am not entirely sure what you mean by this. I personally have never heard of researchers computing branch lengths under a parsimony assumption, and most of the literature that I can find on this topic suggests that distances are computed under some Markov substitution process, such as JC. If you want the minimum number of substitutions required the JC model should suit your needs for short distances.

At this point, I feel obligated to inform you that there is a good chance that even on the SARS-CoV-2 data of underestimated branch lengths when using parsimony, even though the distances between most of the sequences is small. Furthermore, in our own investigations on the SARS-CoV-2 data that the topologies produced by phylogenetic methods are highly suspect. In particular, there are likely to be many trees which either maximize the likelihood. I suspect that in this case, since the distances between the sequences are so small, the most parsimonious tree is similar to the maximum likelihood tree. This is to say, the parsimony surface is likely to exhibit many maxima, and it will be difficult to distinguish between different topologies from the signal present in the sequences alone.

Now that I have said that, if you really want the minimum number of substitutions required for a given tree, there is an existing bit of code that I can adapt to do this if you need it. (This was written before @xflouris wrote his example, but the offer still stand if you need it).

Best,
Ben

@willfischer
Copy link
Author

willfischer commented Aug 11, 2020 via email

@willfischer
Copy link
Author

willfischer commented Aug 11, 2020 via email

@computations
Copy link
Collaborator

computations commented Aug 12, 2020

In regards to the choice of parsimony, I think it is a fine choice for this application (I did some snooping and found what I strongly suspect is your preprint for this paper), but I just wanted to be sure you were aware of the problems. I wish you luck in your endeavors!

Here is a quick little program that was adapted from @xflouris examples. I didn't work to hard on it, so I don't offer any strong guarantees here. To build it, in the root directory of libpll run

cmake -Bbuild -H.
pushd build
make
popd
gcc npr-pars.c -Isrc build/src/libpll.a -lm

where npr-pars.c will be what you choose to name the code below. A few caveats: It only works on rooted binary trees, and it only accepts phylip format files. It also computes the parsimony score associated with a node and not a branch. The argument order is [tree] [alignment]. Also, this is uncorrected by the number of sites. If you want to correct it on line 215 change parscore to (double)parscore/(double)pars->sites

I know this is a bit rough, but I hope this helps!
Ben

EDIT: the previous code posted here computed the total parsimony score for the subtree. This new version computes the marginal parsimony score instead.

/*
    Copyright (C) 2015 Tomas Flouri

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU Affero General Public License as
    published by the Free Software Foundation, either version 3 of the
    License, or (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU Affero General Public License for more details.

    You should have received a copy of the GNU Affero General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

    Contact: Tomas Flouri <Tomas.Flouri@h-its.org>,
    Exelixis Lab, Heidelberg Instutute for Theoretical Studies
    Schloss-Wolfsbrunnenweg 35, D-69118 Heidelberg, Germany
*/

#include "pll.h"
#include <search.h>
#include <stdarg.h>
#include <time.h>

#define STATES 4
#define MAP pll_map_nt

static void fatal(const char *format, ...) __attribute__((noreturn));

static void *xmalloc(size_t size) {
  void *t;
  t = malloc(size);
  if (!t)
    fatal("Unable to allocate enough memory.");

  return t;
}

static char *xstrdup(const char *s) {
  size_t len = strlen(s);
  char *p = (char *)xmalloc(len + 1);
  return strcpy(p, s);
}

/* a callback function for performing a full traversal */
static int cb_full_traversal(pll_rnode_t *node) { return 1; }

static void fatal(const char *format, ...) {
  va_list argptr;
  va_start(argptr, format);
  vfprintf(stderr, format, argptr);
  va_end(argptr);
  fprintf(stderr, "\n");
  exit(EXIT_FAILURE);
}

int main(int argc, char *argv[]) {
  unsigned int i, j, n;
  unsigned int tip_nodes_count, inner_nodes_count, nodes_count, branch_count;
  unsigned int ops_count;
  pll_parsimony_t *pars;
  pll_pars_buildop_t *operations;
  pll_pars_recop_t *recops;
  pll_rnode_t **travbuffer;

  /* we accept only two arguments - the newick tree (unrooted binary) and the
     alignment in the form of PHYLIP reads */
  if (argc != 3)
    fatal(" syntax: %s [newick] [phylip]", argv[0]);

  /* parse the unrooted binary tree in newick format, and store the number
     of tip nodes in tip_nodes_count */
  pll_rtree_t *tree = pll_rtree_parse_newick(argv[1]);
  if (!tree)
    fatal("Tree must be a rooted binary tree");

  /* compute and show node count information */
  tip_nodes_count = tree->tip_count;
  inner_nodes_count = tip_nodes_count - 1;
  nodes_count = inner_nodes_count + tip_nodes_count;
  branch_count = nodes_count - 1;

  printf("Number of tip/leaf nodes in tree: %d\n", tip_nodes_count);
  printf("Number of inner nodes in tree: %d\n", inner_nodes_count);
  printf("Total number of nodes in tree: %d\n", nodes_count);
  printf("Number of branches in tree: %d\n", branch_count);

  /* Uncomment to display the parsed tree ASCII tree together with information
     as to which CLV index, branch length and label is associated with each
     node. The code will also write (and print on screen) the newick format
     of the tree.

  pll_utree_show_ascii(tree->nodes[nodes_count-1],
                       PLL_UTREE_SHOW_LABEL |
                       PLL_UTREE_SHOW_BRANCH_LENGTH |
                       PLL_UTREE_SHOW_CLV_INDEX);
  char * newick = pll_utree_export_newick(tree->nodes[nodes_count-1],NULL);
  printf("%s\n", newick);
  free(newick);

  */

  /* create a libc hash table of size tip_nodes_count */
  hcreate(tip_nodes_count);

  /* populate a libc hash table with tree tip labels */
  unsigned int *data =
      (unsigned int *)xmalloc(tip_nodes_count * sizeof(unsigned int));
  for (i = 0; i < tip_nodes_count; ++i) {
    data[i] = tree->nodes[i]->clv_index;
    ENTRY entry;
#ifdef __APPLE__
    entry.key = xstrdup(tree->nodes[i]->label);
#else
    entry.key = tree->nodes[i]->label;
#endif
    entry.data = (void *)(data + i);
    hsearch(entry, ENTER);
  }

  /* read PHYLIP alignment */
  pll_phylip_t *fd = pll_phylip_open(argv[2], pll_map_phylip);
  if (!fd)
    fatal(pll_errmsg);

  pll_msa_t *msa = pll_phylip_parse_interleaved(fd);
  if (!msa)
    fatal(pll_errmsg);

  pll_phylip_close(fd);

  if ((unsigned int)(msa->count) != tip_nodes_count)
    fatal("Number of sequences does not match number of leaves in tree");

  /* create the PLL parsimony instance

     parameters:

     tips : number of tip sequences we want to have
     states : number of states that our data have
     sites : number of sites
     score_matrix : score matrix to use
     score_buffers: number of score buffers to allocate (typically one per
                    inner node)
     ancestral_buffers: number of ancestral state buffers to allocate
                        (typically one per inner node)
  */

  /* set a matrix where mutations are penalized equally by 1 */
  double score_matrix[STATES * STATES];
  for (i = 0; i < STATES * STATES; ++i)
    score_matrix[i] = 1;
  for (i = 0; i < STATES; ++i)
    score_matrix[i * STATES + i] = 0;

  pars =
      pll_parsimony_create(tip_nodes_count, STATES, (unsigned int)(msa->length),
                           score_matrix, inner_nodes_count, inner_nodes_count);

  /* find sequences in hash table and link them with the corresponding taxa */
  for (i = 0; i < tip_nodes_count; ++i) {
    ENTRY query;
    query.key = msa->label[i];
    ENTRY *found = NULL;

    found = hsearch(query, FIND);

    if (!found)
      fatal("Sequence with header %s does not appear in the tree",
            msa->label[i]);

    unsigned int tip_clv_index = *((unsigned int *)(found->data));

    // printf("Setting sequence: %s\n", msa->sequence[i]);
    pll_set_parsimony_sequence(pars, tip_clv_index, MAP, msa->sequence[i]);
  }

  pll_msa_destroy(msa);

  /* destroy hash table */
  hdestroy();

  /* we no longer need these two arrays (keys and values of hash table... */
  free(data);

  /* allocate a buffer for storing pointers to nodes of the tree in postorder
     traversal */
  travbuffer = (pll_rnode_t **)xmalloc(nodes_count * sizeof(pll_rnode_t *));

  operations = (pll_pars_buildop_t *)xmalloc(inner_nodes_count *
                                             sizeof(pll_pars_buildop_t));

  /* perform a postorder traversal of the rooted tree */
  unsigned int traversal_size;
  if (!pll_rtree_traverse(tree->root, PLL_TREE_TRAVERSE_POSTORDER,
                          cb_full_traversal, travbuffer, &traversal_size))
    fatal("Function pll_rtree_traverse() requires inner nodes as parameters");

  /* given the computed traversal descriptor, generate the build operations
     structure */
  pll_rtree_create_pars_buildops(travbuffer, traversal_size, operations,
                                 &ops_count);

  // printf("Traversal size: %d\n", traversal_size);
  // printf("Operations: %d\n", ops_count);

  double minscore = pll_parsimony_build(pars, operations, ops_count);
  printf("Minimum parsimony score: %f\n", minscore);

  for (i = 0; i < traversal_size; ++i) {
    unsigned int id = travbuffer[i]->clv_index;
    uint64_t parscore = pll_parsimony_score(pars, id);
    travbuffer[i]->length = parscore;
  }

  // correct the scores so that they are marginal

  for (i = 0; i < traversal_size; ++i) {
    pll_rnode_t *cur = travbuffer[i];
    if (cur->left)
      cur->length -= cur->left->length;
    if (cur->right)
      cur->length -= cur->right->length;
  }

  char *newick_string = pll_rtree_export_newick(tree->root, 0);
  printf("%s\n", newick_string);

  /* free traversal, build operations and reconstruct operations */
  free(travbuffer);
  free(operations);
  free(newick_string);

  /* we will no longer need the tree structure */
  pll_rtree_destroy(tree, NULL);

  /* destroy the parsimony structure */
  pll_parsimony_destroy(pars);

  return (EXIT_SUCCESS);
}

@willfischer
Copy link
Author

willfischer commented Aug 12, 2020 via email

@willfischer
Copy link
Author

willfischer commented Aug 13, 2020 via email

@xflouris
Copy link
Owner

xflouris commented Aug 13, 2020

Hello Will and Ben,
I haven't checked the code Ben wrote, but I suspect it must derive from the parsimony example which computes the parsimony score of the tree. I think this is not needed since Will has already done the inference, and the branches are what matter here.

Ben, if you have some free time and want to do it, I can write you some more information via email. You will basically need to distinguish between parsimony informative and parsimony uninformative sites. Uninformative sites which are constant contribute no mutations to the lineages. Other uninformative sites contribute one mutation to the terminal branches that correspond to characters (states) that appear only once. You need to sum the number of mutations at each branch, and divide by the number of sites to get the branch lengths. I think it should be an hour or two of work.

Otherwise, I can do it next week. I'm busy with some grant writing atm, so I think I will have time on monday/tuesday.

@willfischer
Copy link
Author

willfischer commented Aug 13, 2020 via email

@computations
Copy link
Collaborator

computations commented Aug 13, 2020 via email

@willfischer
Copy link
Author

willfischer commented Aug 13, 2020 via email

@xflouris
Copy link
Owner

Hello Will,
attached below is a modification of the parsimony example from here: https://github.com/xflouris/libpll/tree/master/examples/parsimony

I have added a new function called comp_branch_lenghts() that computes the branch lengths (number of mutations on a branch). Please note that optimal branch lengths are not unique as there may exist many sets of branch lengths that yield the same cost, particularly when the tree tends to be symmetrical. Consider for example the tree ((A,B),(C,D)); and a single site AACC. Either of the internal branch lengths may have length 1 and the other 0.

Branch lengths are then normalized by the number of sites in the main loop of the program. To measure how long a run will take, you can run the program for a small portion of the alignment and extrapolate for the full alignment.

Let me know if you have any questions, perhaps we can move the discussion to email: t.flouris@ucl.ac.uk , although I will be quite busy for the next two weeks.

Tomas

pars-bl.c.gz

@willfischer
Copy link
Author

willfischer commented Aug 17, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants