Issue with strain phylogeny #53

kylecampbell3 · 2017-05-03T22:43:23Z

Hi Stephen,

Thank you for your program!

I am trying to use MIDAS to construct a detailed strain phylogeny for a single species, and am running into a bit of trouble in the final steps. I created a custom database using a pan-genome reference that had previously been assembled and ran run_midas.py species and run_midas.py snps individually for three metagenomic samples (each around 2GB). I merged the snp results for the three samples and then ran call_consensus.py. When I used the consensus.fa file output to construct a phylogenetic tree on FastTree, it produced a tree with only three leaves, treating each metagenomic output as a strain, rather than identifying many different strains within each metagenome. Is this the way that the program is supposed to work, or could settings be adjusted to produce a more detailed phylogeny?

Thanks for your help,

Kyle Campbell

snayfach · 2017-05-03T22:51:12Z

Hi Kyle, Unfortunately at the current time, MIDAS only produces strain phylogenies based on consensus sequences. As you observe, that means that you will get only 1 leaf per metagenome. These phylogenies will be accurate only if the intra-metagenome diversity is low. I would suggest performing some followup analysis to assess the intra-metagenome and inter-metagenome diversity levels. The script `snp_diversity.py` should help in that regard. If the intra-diversity is very low, you have clonal populations and the phylogenetic tree based on consensus sequences is likely accurate. Please let me know if you have any questions. In the future I may add code to estimate strain phylogenies that take into account multiple strains/species/sample. Best, Stephen

…

On Wed, May 3, 2017 at 3:43 PM, kylecampbell3 ***@***.***> wrote: Hi Stephen, Thank you for your program! I am trying to use MIDAS to construct a detailed strain phylogeny for a single species, and am running into a bit of trouble in the final steps. I created a custom database using a pan-genome reference that had previously been assembled and ran run_midas.py species and run_midas.py snps individually for three metagenomic samples (each around 2GB). I merged the snp results for the three samples and then ran call_consensus.py. When I used the consensus.fa file output to construct a phylogenetic tree on FastTree, it produced a tree with only three leaves, treating each metagenomic output as a strain, rather than identifying many different strains within each metagenome. Is this the way that the program is supposed to work, or could settings be adjusted to produce a more detailed phylogeny? Thanks for your help, Kyle Campbell — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#53>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACAbrWWnWlQsYw_esx3imKT-M-HRCsOmks5r2QMLgaJpZM4NQCtz> .

marianne-LotterJones · 2017-05-03T23:39:16Z

Hi Stephen,
I am working with Kyle on this project. Thank you for your suggestion! However, when we enter the code, we keep coming across this error:

ubuntu@ip-172-31-61-218:~/MIDAS$ python /home/ubuntu/MIDAS/scripts/snp_diversity.py --indir /home/ubuntu/MIDAS/psb1_snpmerge_out/psb1/ --out /home/ubuntu/MIDAS/snp_diversity

MIDAS: Metagenomic Intra-species Diversity Analysis System
version 1.2.1; github.com/snayfach/MIDAS
Copyright (C) 2015-2016 Stephen Nayfach
Freely distributed under the GNU General Public License (GPLv3)

===============================
===========Parameters===========
Command: /home/ubuntu/MIDAS/scripts/snp_diversity.py --indir /home/ubuntu/MIDAS/psb1_snpmerge_out/psb1/ --out /home/ubuntu/MIDAS/snp_diversity
Script: snp_diversity.py
Input directory: /home/ubuntu/MIDAS/psb1_snpmerge_out/psb1/
Output file: /home/ubuntu/MIDAS/snp_diversity
Diversity options:
genomic_type: genome-wide
sample_type: per-sample
site_type: ['NC', '1D', '2D', '3D', '4D']
weight_by_depth: False
rand_reads: None
replace_reads: False
rand_samples: None
rand_sites: None
Sample filters:
sample_depth: 0.0
fract_cov: 0.0
max_samples: inf
keep_samples: None
exclude_samples: None
Site filters:
site_depth: 2
site_prev: 0.0
site_maf: 0.0
site_ratio: inf
site_freq: 0.0
max_sites: inf

Selecting subset of samples...
Estimating diversity metrics...
Traceback (most recent call last):
File "/home/ubuntu/MIDAS/scripts/snp_diversity.py", line 339, in
pi = compute_pi(args, samples)
File "/home/ubuntu/MIDAS/scripts/snp_diversity.py", line 215, in compute_pi
for site in diversity.parse_sites(args['indir'], samples): # loop over sites
File "/home/ubuntu/MIDAS/midas/analyze/diversity.py", line 144, in parse_sites
site = GenomicSite(files, samples, info)
File "/home/ubuntu/MIDAS/midas/analyze/diversity.py", line 19, in init
self.ref_id, self.ref_pos, self.ref_allele = self.parse_id()
File "/home/ubuntu/MIDAS/midas/analyze/diversity.py", line 34, in parse_id
ref_pos = int(self.id.split('|')[1])
ValueError: invalid literal for int() with base 10: 'quiver'

Having not had much coding experience, we do not know what this means or how to fix it. Do you have any suggestions?

snayfach · 2017-05-03T23:41:58Z

Thanks for reporting the error. I will look into it and get back to you.

marianne-LotterJones · 2017-05-04T14:59:38Z

We think the ValueError: invalid literal for int() with base 10: 'quiver' is based on our input system where the program is expecting an integer and instead there's string. We removed 'quiver' from our input file and the program ran.

snayfach · 2017-05-14T15:51:16Z

Hi Kyle - I'm posting your questions from email here in case it is useful to others. You wrote:

As you suggested, I ran the followup analysis to assess the intra-metagenome and inter-metagenome diversity of three metagenomic samples using the 'snp_diversity.py' script.   
It gave me the following output:   

sample_id     depth               count_sites     diversity
psb1out_3    1494503086     7159108         19335.1858363
lizzy2out       144701011       7159108         16136.8075072
lizzy9out       247182331       7159108         27027.809093

Could you explain the meaning of the diversity statistic?  Is this the number of unique strains found in each sample, or is there some other method you used to calculate this?

First off, it looks like these are intra-sample diversity estimates. Could you run the script for inter-sample diversity also?

The diversity statistic is the sum of the number of nucleotide substitutions over the total number of sites analyzed. So if you divide by count_sites, you will get the number of substitutions per site. You can read more about this statistic here: https://en.wikipedia.org/wiki/Nucleotide_diversity

After dividing by the number of sites, per-bp diversity ranges from 2.2e-3 to 3.8e-3. In my experience this indicates a relatively low-diversity community that is probably composed of one dominant strain. If you look at figure 4 from my biorxiv paper you can compare your results to other human gut bacteria: http://biorxiv.org/content/early/2015/11/14/031757

I would recommend using the same script to compute between-sample (i.e. inter-sample) diversity. This script will pool the data across your 3 samples (assuming they are the same species) and use this pooled data to compute diversity again. If this number is much higher than the per-sample diversity, then you can conclude that the nucleotide differences between populations are much greater than those within the individual populations. Does this make sense?

snayfach closed this as completed Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with strain phylogeny #53

Issue with strain phylogeny #53

kylecampbell3 commented May 3, 2017

snayfach commented May 3, 2017 via email

marianne-LotterJones commented May 3, 2017

snayfach commented May 3, 2017

marianne-LotterJones commented May 4, 2017

snayfach commented May 14, 2017

Issue with strain phylogeny #53

Issue with strain phylogeny #53

Comments

kylecampbell3 commented May 3, 2017

snayfach commented May 3, 2017 via email

marianne-LotterJones commented May 3, 2017

snayfach commented May 3, 2017

marianne-LotterJones commented May 4, 2017

snayfach commented May 14, 2017