Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many more reads in SNP mapping for 1.3.0 vs 1.2.1 #66

Closed
surh opened this issue Aug 17, 2017 · 4 comments
Closed

Many more reads in SNP mapping for 1.3.0 vs 1.2.1 #66

surh opened this issue Aug 17, 2017 · 4 comments

Comments

@surh
Copy link

surh commented Aug 17, 2017

I recently switched from version 1.2.1 to version 1.3.0. I realize there are newer versions of the alignment software in this release so I wasn't surprised when I saw some tiny differences in species identifications. Here is an example of abundances from MIDAS/1.2.1

$ head /scratch/users/surh/midas/hmp/SRS051941/species/species_profile.txt 
species_id	count_reads	coverage	relative_abundance
Corynebacterium_matruchotii_57066	20696	188.980369951	0.291044573929
Actinomyces_sp_62446	3293	28.2154305895	0.0434539734274
Selenomonas_noxia_56778	2150	20.6481371969	0.0317997488018
Leptotrichia_buccalis_61702	2092	20.2566223535	0.0311967852728
Capnocytophaga_granulosa_57613	1895	18.7777777778	0.0289192487775
Haemophilus_parainfluenzae_62468	1888	17.649813272	0.027182095077
Capnocytophaga_sp_58386	1697	15.5149051491	0.0238941693247
Rothia_dentocariosa_57938	1641	13.9604698045	0.021500217124

And here are the corresponding results from mapping with MIDAS/1.3.0

$ head species_profile.txt 
species_id	count_reads	coverage	relative_abundance
Corynebacterium_matruchotii_57066	20696	188.980369951	0.291047096058
Actinomyces_sp_62446	3293	28.2154305895	0.0434543499901
Selenomonas_noxia_56778	2150	20.6481371969	0.0318000243715
Leptotrichia_buccalis_61702	2093	20.2645987198	0.0312093399527
Capnocytophaga_granulosa_57613	1895	18.7777777778	0.0289194993854
Haemophilus_parainfluenzae_62468	1886	17.63975869	0.0271668456529
Capnocytophaga_sp_58386	1712	15.6553593122	0.0241106886749
Rothia_dentocariosa_57938	1642	13.9689718643	0.0215134973917
Capnocytophaga_ochracea_58179	1318	11.9352005916	0.018381303169

You can see that there are some tiny differences that probably don't influence the results.

However, I then tried to obtain SNPs from a species, with the following command:

run_midas.py snps /scratch/users/surh/micropopgen/exp/2017/today3/midas121/SRS051941 -1 /scratch/users/surh/hmp_samples//SRS051941_read1.fastq.bz2 -2 /scratch/users/surh/hmp_samples//SRS051941_read2.fastq.bz2 -t 8 --remove_temp --species_cov 3.0 --mapid 94.0 --mapq 20 --baseq 30 --readq 30 --species_id  Haemophilus_parainfluenzae_62356

I used the exact same command (except for output directory) with both MIDAS/1.2.1 and MIDAS/1.3.0

This is what I get from MIDAS/1.2.1:

$ zcat midas121/SRS051941/snps/output/Haemophilus_parainfluenzae_62356.snps.gz | head
ref_id	ref_pos	ref_allele	alt_allele	ref_freq	depth	count_atcg
FQ312002	1	T	NA	0.0	0	0,0,0,0
FQ312002	2	A	NA	1.0	1	1,0,0,0
FQ312002	3	T	NA	1.0	2	0,2,0,0
FQ312002	4	G	NA	1.0	2	0,0,0,2
FQ312002	5	G	NA	1.0	4	0,0,0,4
FQ312002	6	C	NA	1.0	6	0,0,6,0
FQ312002	7	T	A	0.833333333333	6	1,5,0,0
FQ312002	8	A	NA	1.0	7	7,0,0,0
FQ312002	9	T	NA	1.0	7	0,7,0,0

And this is what I get from MIDAS/1.3.0

$ zcat midas130/SRS051941/snps/output/Haemophilus_parainfluenzae_62356.snps.gz | head
ref_id	ref_pos	ref_allele	depth	count_a	count_c	count_g	count_t
FQ312002	1	T	15	0	0	0	15
FQ312002	2	A	16	16	0	0	0
FQ312002	3	T	17	0	0	0	17
FQ312002	4	G	18	0	0	18	0
FQ312002	5	G	17	0	0	17	0
FQ312002	6	C	20	0	20	0	0
FQ312002	7	T	20	0	0	0	20
FQ312002	8	A	23	23	0	0	0
FQ312002	9	T	21	0	0	0	21

You can see I get dramatically more reads in MIDAS/1.3.0 though it seems like the major allele is consistent.

I just wonder if such big differences are expected. I attach a couple of small read files that seem to reproduce the pattern, which I didn't' see with the test sample in the test directory.

Thanks,
Sur
read1.fastq.gz
read2.fastq.gz

@snayfach
Copy link
Owner

Thank you for reporting this.

One thing I forgot to add in the release notes is that I switched from global to local Bowtie2 alignment in the SNPs module. I made this change because the minimum read alignment coverage is specified by the --aln_cov option. The default value is 0.75, meaning that the read must be covered by 75% by the alignment to the representative genome.

For consistency, I will add an option to allow the user to specify local or global alignment and make a new release in the next day or two. I should be able to confirm that this change results in output that is consistent with v1.2.1.

Stephen

@surh
Copy link
Author

surh commented Aug 17, 2017

Thanks for clarifying. Do you then recommend using the local alignment with 75% query coverage? I can see that that will produce more hits, but how much specificity is sacrificed?

Thanks,
Sur

@snayfach
Copy link
Owner

Yes, that is what I would recommend. But feel free to experiment with increasing the alignment coverage and seeing if it changes your results. In the future, this is something that could be more rigorously evaluated along with other parameters in the SNP calling pipeline.

Stephen

@snayfach
Copy link
Owner

Hi Sur,

I've addressed this issue with a new release. Please see the release notes for details. I'm closing this issue, but feel free to reopen it in the future if you notice other discrepancies.

Thanks,
Stephen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants