Scripts used to reproduce the analyses of the manuscript:
Wang, Y., Robinson, P.S., Coorens, T.H.H. et al. APOBEC mutagenesis is a common process in normal human small intestine. Nat Genet 55, 246–254 (2023). https://doi.org/10.1038/s41588-022-01296-5
Please contact Yichen (yw2@sanger.ac.uk) if you have any questions and enquires.
Raw sequencing data can be accessed from the European Genome-phenome Archive (EGA) with the accession code EGAD00001008764.
Input data for all figures can be found under the /data directory.
data/somatic_mutations/snp/ contains the final SNPs placed on phylogenetic branches.
data/somatic_mutations/indel/ contains the final INDELs placed on phylogenetic branches.
data/mutation_matrices/ contains the final SBS and ID matrices for the cohort.
data/signatures/ contains the original HDP sigantures, their decomposition to PCAWAG reference signatures, and the final reference signature exposures for each samples and each phylogenetic branch.
data/phylogenetic_trees/ contains the phylogenetic trees generated by MPBoot using single-base substitutions and indels. The length of the branch represents the number of mutations on the branch.
data/vcf/ contains the final SNP vcf files for finding and plotting kataegis.
data/cancer/ contains the paired cancer data for comparison (mutational burden and mutational siganture exposures).
data/motif/ contains enrichment scores for TCN motifs
The final mutation files can be found in extended data tables and data/somatic_mutations/.
Alternatively, they can be generated from the raw file via the Sanger pipeline (https://github.com/cancerit) using CaveMan, Pindel, ASCAT, BRASS. When a matched normal sample is available, run all algorithms using that sample as matched normal. Otherwise, run unmatched with a synthetic bam PDv37is.
The filters applied to SNVs and Indels to exclude LCM artefacts can be found at: https://github.com/MathijsSanders/SangerLCMFiltering, and the beta-binomial filter to exclude germline mutations are here: https://github.com/TimCoorens/Unmatched_NormSeq.
Phylogentic trees were constructed using MPBoot, with supplementary code in Phylogeny/.
Phylogeny/filtering.R contains the beta-binomial filter for the previous step and will generate the input *for_MPBoot.fa file for MPBoot. Then run:
mpboot -s $patient/${opt}_for_MPBoot.fa -bb 1000
Reconstructed phylogenetic trees in .tree and .csv format (with number of mutations on each branch) can be found at data/phylogenetic_trees/. The trees can then be visualised by treeplots.R.
This part of analysis generated the tree plots dislayed in Fig.2, Fig.3 and Extended Data Fig.3.
We only kept branches with > 50 mutations during the run, and the input data can be found at data/mutational_matrices/.
Workflow and code are in the directory /Signatures. This part of analysis generated Extended Data Fig.9.
The input file is at data/Extended_Data_Table3_crypt_summary.csv
Workflow and code are in the directory Mutation_burden/.
This part of analysis generated the plots dislayed in Fig.1, Extended Data Fig.4 and Extended Data Fig.8.
The code is in the directory Kataegis/ and the input vcf files are at data/vcf/.
This part of analysis generated the plots dislayed in Fig.4.
The code is in the directory Expression/, instructions about how to download the input dataset are included in the script.
This analysis generated statistics in Table 1.
Others/stem_cell/ contains code and input files for simulating stem cell dynamics.
Others/VAF.R : To generate VAF distribution plots of all samples (Extended Data Fig.1).
Others/APOBEC_motif_enrichment.R: We ran P-MACD to extract context freqeuncy and this is the post-processing code for P-MACD results (Extended Data Fig.7c).