Skip to content

YichenWang1/small_bowel

Repository files navigation

APOBEC mutagenesis is a common process in normal human small intestine. Wang et al.

Scripts used to reproduce the analyses of the manuscript:

Wang, Y., Robinson, P.S., Coorens, T.H.H. et al. APOBEC mutagenesis is a common process in normal human small intestine. Nat Genet 55, 246–254 (2023). https://doi.org/10.1038/s41588-022-01296-5

Please contact Yichen (yw2@sanger.ac.uk) if you have any questions and enquires.

Dataset

Raw sequencing data can be accessed from the European Genome-phenome Archive (EGA) with the accession code EGAD00001008764.

Input data for all figures can be found under the /data directory.

data/somatic_mutations/snp/ contains the final SNPs placed on phylogenetic branches.

data/somatic_mutations/indel/ contains the final INDELs placed on phylogenetic branches.

data/mutation_matrices/ contains the final SBS and ID matrices for the cohort.

data/signatures/ contains the original HDP sigantures, their decomposition to PCAWAG reference signatures, and the final reference signature exposures for each samples and each phylogenetic branch.

data/phylogenetic_trees/ contains the phylogenetic trees generated by MPBoot using single-base substitutions and indels. The length of the branch represents the number of mutations on the branch.

data/vcf/ contains the final SNP vcf files for finding and plotting kataegis.

data/cancer/ contains the paired cancer data for comparison (mutational burden and mutational siganture exposures).

data/motif/ contains enrichment scores for TCN motifs

Variant calling

The final mutation files can be found in extended data tables and data/somatic_mutations/.

Alternatively, they can be generated from the raw file via the Sanger pipeline (https://github.com/cancerit) using CaveMan, Pindel, ASCAT, BRASS. When a matched normal sample is available, run all algorithms using that sample as matched normal. Otherwise, run unmatched with a synthetic bam PDv37is.

Filtering

The filters applied to SNVs and Indels to exclude LCM artefacts can be found at: https://github.com/MathijsSanders/SangerLCMFiltering, and the beta-binomial filter to exclude germline mutations are here: https://github.com/TimCoorens/Unmatched_NormSeq.

Phylogenetic tree reconstruction

Phylogentic trees were constructed using MPBoot, with supplementary code in Phylogeny/.

Phylogeny/filtering.R contains the beta-binomial filter for the previous step and will generate the input *for_MPBoot.fa file for MPBoot. Then run:

mpboot -s $patient/${opt}_for_MPBoot.fa -bb 1000

Reconstructed phylogenetic trees in .tree and .csv format (with number of mutations on each branch) can be found at data/phylogenetic_trees/. The trees can then be visualised by treeplots.R.

This part of analysis generated the tree plots dislayed in Fig.2, Fig.3 and Extended Data Fig.3.

Mutational signature extraction

We only kept branches with > 50 mutations during the run, and the input data can be found at data/mutational_matrices/.

Workflow and code are in the directory /Signatures. This part of analysis generated Extended Data Fig.9.

Mutation burden analysis

The input file is at data/Extended_Data_Table3_crypt_summary.csv

Workflow and code are in the directory Mutation_burden/.

This part of analysis generated the plots dislayed in Fig.1, Extended Data Fig.4 and Extended Data Fig.8.

Local hypermutation (kataegis) analysis

The code is in the directory Kataegis/ and the input vcf files are at data/vcf/.

This part of analysis generated the plots dislayed in Fig.4.

Single cell RNA-seq of small and large intestine

The code is in the directory Expression/, instructions about how to download the input dataset are included in the script.

This analysis generated statistics in Table 1.

Others

Others/stem_cell/ contains code and input files for simulating stem cell dynamics.

Others/VAF.R : To generate VAF distribution plots of all samples (Extended Data Fig.1).

Others/APOBEC_motif_enrichment.R: We ran P-MACD to extract context freqeuncy and this is the post-processing code for P-MACD results (Extended Data Fig.7c).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published