SDPR_paper

We are not able to release data used in the analysis because of the restricted access to the UK Biobank genotype and phenotype data. However, if you have access to UK Biobank, you can follow the instructions to reproduce the results in our paper. The path to the software and datasets in these scripts may not be correct. We use slurm to schedule jobs to HPC, and you need to change the header if you use another system. If you have issues about these scripts, please report to the issue page.

Simulations

Requirements

PRS-CS (python2.7, numpy, scipy, recommend to install anaconda)
GCTB
GCTA
LDpred
PLINK-1.90
PLINK-2.0
R
about 340 Gb desk space

Workflow

1. Obtaining Genotype data

We provide the list of SNPs used in our simulations. Sample list used in the simulation can be obtained upon request so that you can obtain the exact genotype data we used if you have the access to UK Biobank.

git clone https://github.com/eldronzhou/SDPR_paper.git
cd UKB_simulate/genotype
cd discover/10K 
sbatch plink_merge.sh # you need to change the path to the original UK Biobank genotype

# repeat the above procedure for discover/50K, discover/100K, validate/ and test/

2. Simulating phenotype and generating summary statistics

Next run the simulation script to generate summary statistics. You need to change to the directory of UKB_simulate/ to submit the script.

# for Scene1A
sbatch GCTA_sim.sh

# change the j in GCTA_sim.sh and resubmit
# for Scene1B, Scene1C, Scene4 and Scene5

3. Constructing the reference LD matrix

# download reference of PRS_CS
cd ref/PRS_CS; sh get_ref.sh 

# download reference of SDPR
cd ../SDPR; sh get_ref.sh 

# estimate ref of gctb for each chromosome
cd ../gctb; sbatch --array=1-22 gctb.sh

4. Running the analysis

We will use Scene1A as the example for demonstration. You can repeat the same procedure for Scene1B, Scene1C and Scene4.

cd result/Scene1A/h2_0.5/10K/

# SBayesR
cd gctb/; sbatch --array=1-10 gctb.sh

# PRS-CS
cd ../PRS_CS/; sbatch --array=1-22 PRS_CS.sh
# after all jobs finish
sbatch --array=1-10 PRS_CS_res.sh

# LDpred
cd ../ldpred/; sbatch --array=1-10 ldpred.sh

# P+T
cd ../P+T/; sbatch --array=1-10 clumping.sh

# SDPR
cd ../SDPR/; sbatch --array=1-10 SDPR.sh

# repeat the above procedures for 50K and 100K

# after all jobs finished, make the plot
# under the directory of UKB_simulate/result/Scene1A/h2_0.5/
Rscript get_Res.R

# LDpred2
cd ../ldpred2/; sbatch --array=1-10%4 ldpred2.sh

# lassosum
cd ../LASSOSUM/; sbatch --array=1-10 lassosum.sh

# DBSLMM
cd ../DBSLMM/; sbatch --array=1-10 dbslmm.sh

Real data applications

Requirements

same requirements as simulations except for PLINK-2.0 and GCTA
LDSC
Installed SDPR
about 150 Gb desk space

Workflow

Sample list of UKB_real and 5000 ref for SBayesR can be obtained upon request.

1. Obtaining Genotype data

cd UKB_real/genotype; sbatch plink_merge.sh

# 5000 UKB individuals for ref matrix of SBayesR
cd ref_5000/; sbatch plink_merge.sh

2. Constructing the reference LD matrix

cd UKB_real/ref

# get genotype of 1000G EUR
cd 1000G/; sh get_ref.sh

# get reference of PRS_CS
cd ../PRS_CS/; sh get_ref.sh

# estimate reference for SBayesR
cd ../GCTB/; sbatch --array=1-22 gctb.sh

# estimate reference for SDPR
cd ../SDPR/; sbatch --array=1-22 SDPR_ref.sh

3. Obtaining and cleaning the summary statistics

Many GWAS consortium publishes summary statistics. However, due to the data access agreement, we are not able to directly provide the original copy. You can find the study of the summary statistics in the Table 1 of the manuscript and download the summary statistics on your own. If you need assistance, feel free to submit to the issue. Here we provide an example on downloading and processing the height summary statistics from the GIANT consortium. The procedure is similar for other traits.

cd UKB_real/HGT/summ_stats/

# download
wget https://portals.broadinstitute.org/collaboration/giant/images/0/01/GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz

# make sure you check the name of summary statistics is right
# you also need to change the path of LDSC, munge_summstats.py, and reference of LDSC
Rscript clean1.R

4. Running the analysis

Here is the example on running analysis for height. The procedure is similar for other traits.

cd UKB_real/HGT/result/

# SBayesR
cd gctb/; sbatch gctb.sh

# PRS-CS
cd ../PRS_CS/; sbatch --array=1-22 PRS_CS.sh
# after all jobs finish
sbatch PRS_CS_res.sh

# LDpred
cd ../ldpred/; sbatch ldpred.sh

# P+T
cd ../P+T/; sbatch clumping.sh

# SDPR
cd ../SDPR/; sbatch --array=1-22 SDPR.sh
# after all jobs finish
sbatch SDPR_res.sh

# LDpred2
cd ../ldpred2/; sbatch ldpred2.sh

# lassosum
cd ../LASSOSUM/; sbatch lassosum.sh

# DBSLMM
cd ../DBSLMM/; sbatch dbslmm.sh

We provide the script to make the figure under the directory UKB_real/HGT/result/predict, although the input file is not available because the restricted access to the UK Biobank phenotype information.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
UKB_real		UKB_real
UKB_simulate		UKB_simulate
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SDPR_paper

Table of Contents

Simulations

Requirements

Workflow

Real data applications

Requirements

Workflow

About

Releases

Packages

Languages

eldronzhou/SDPR_paper

Folders and files

Latest commit

History

Repository files navigation

SDPR_paper

Table of Contents

Simulations

Requirements

Workflow

Real data applications

Requirements

Workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages