Tourette Syndrome Candidate Gene Analysis (TSCGA)

This repository holds the script that gathers information on mutations and genes from multiple database files and writes to an output file.

Running the program

The program is run on the command line as such, python Cand_Gene_MutGene_Pipeline.py --parameter_1 <file_1> --parameter_2 <file_2> etc.
More information is present in running the test files.

Available parameters

--files/-f - This reads in the mutation files. The files can be called explicitly or if they are the only items in a folder they can be called as --files path/files/*. This parameter is required.
--output/-o - This reads in a directory in which all the output files will be written to. This parameter is required.
--anno/-a - This reads in the annotated vcf file to get annovar information on the mutations.
--gtex/-g - This reads in the GTEx mediam tpm file
--disease/-d - This reads in up to 2 files, the first file has the gene names and the disease associations and the second file is a list of keywords used to filter the diseases.
--segpat/-s - This reads in the reflist files used to generate the family segregation patterns. Mutliple families can be read in the same way as the --files parameter.
--genetol/-t - This reads in the gene tolerance score file.
--genes/-e - This reads in sets of 2. The first element is a newline delimited file of gene names and the second element is the column header that is added to the mutation file

Aside from the required parameters, all other parameters are optional and any combination of parameters can be run.
Depending on the parameter run, different columns are required in the input files that are specified in --files.

--anno requires the columns: Chr, VarType, and VarPosition
--segpat requires the file name to be specific (see details below) in addition to the column: Genotype
all other parameters require the column: Gene

More information on parameters can be found below.

Types of input files

Mutation file `--files/-f`

This is a tab delimited file that would contain any of the following information:

Mutation position (defined in the Chr and VarPosition columns)
Type of mutation: ins, del, snp, or mnp (defined in the VarType column)
Gene that the mutation is found in (defined in the Gene column)
Segregation pattern of the mutation if running pedigree analysis (defined in the Genotype column)

Multiple mutation files can be run at once by specifying them on the command line explicitly or running a folder of mutation files with the folder/path/* notation on the command line.
Mutation files are named by the family and the inheritance pattern if pedigree information is being run. Ex. Family1_Dominant.txt

Database Files

Reflist files `--segpat/-s`

This input if for running pedigree analysis.
Reflist files are the family segregation information and is used to compare the family segregation with the mutation segregation.
Multiple families can be run at once by reading in the reflist files in the same manner as the mutation files.
The program reads the file names and matches the segregations off of that so the family names in the file name must match for this reason.

Annotation file `--anno/-a`

This input is in the form of a annotated vcf file. The mutation position from the mutation file is matched to the corresponding line in the annotation file and the information is taken from there.
The program takes up to the follwoing information from the annotated file:

Sift score
Polyphen-2 HVAR and HDIV scores
ExAC allele frequency
gnomAD allele frequency
Clinvar information
Interpro_domain information

Gene tolerance score file `--genetol/-t`

This input contains the following information in a tab delimited file:

Gene symbol to match with gene symbol in the mutation file
mis_z score
pLI score

Disease association file and diease key file `--disease/-d`

The disease association file input has all the diseases that have been linked to a gene based on publication text mining. The format of the lines are: Gene|disease_1;disease_2;disease_3 etc.
The file has 1 gene per line

The disease key file is a text file that has one element per line. The elements are divided by the headings

general:
- element_1
- element_2
specific:
- element_1
exclude:
- element_1

The heading tells the program whether to look for diseases with that keyword case insensitive (general), case sensitive (specific), or to exclude diseases with that keyword (exclude).
Any combination of these categories are allowed and if this file is not supplied at all then all dieases are printed to the output.

GTEx tpm file `--gtex/-g`

This file has tpm numbers for all genes in over 40 tissues, collected by the Genome Tissue Expression Consortium project. The file is a .gct file and holds the median tpm for all genes in each of the tissues.
The program collects tissue expression levels and outputs the following:

Top tissue expression for:
- Brain tissue
- Non-Brain tissue
- All tissues
Median tissue expression for:
- Brain tissue
- Non-Brain tissue
- All tissues

This allows for a quick look at if the gene is expressed higher in the brain than in other tissues

Other Gene information files `--genes/-e`

These files are user generated to add additional evidence for the genes.
These files are one gene name per line and the program reports whether the gene in the mutation file was found in this given file
Ex. For the TS project there were genes found oin a GWAS study and we wanted to add that information into the mutation files. We put the GWAS found genes into a file and the program reported in the mutation file if that gene was found in this GWAS study.

Any number of these gene files can be supplied for additional information of genes in the mutation file.
The program reads in the file and a string given, the string is used as a column header to tell if the gene was found in the provided file.

Output files `--output/-o`

The output file is the mutation file with the database information added to the end of the row.
The program adds column headers to the end of the header row and fills in the information for the mutations in those columns.
The program also adds a tag to the end of the file name and before the .txt extension to let the user know what type of information was added to the file.

Running the test files

The following command will run the input file and gather all information from the rest of the information test files.
The information in these test files is a mix of real and simulated data, please do not assume data in these files are factual.
Not every file has relevant or any information on the genes in the list, so there will be some genes that have the message No Gene Found or will have a .
The command is run where the python script is and the other files are in the users default directory.
python Cand_Gene_MutGene_Pipeline.py --files ~/TS0096_Dominant_test.tsv --segpat ~/TS0096.reflist --anno ~/annotation_hg19_testTS0096.vcf --genetol ~/GeneTolerance_TS0096test.txt --disease ~/Disease_assoc_TS0096test.txt ~/TS_disease_keywords.txt --genes ~/Gene_List_TS0096test.txt Gene_List --gtex ~/GTEx_median_tpm_TS0096test.gct --output .

The output of this command can be found in the TS0096_Dominant_test_segpat_annotated_gene_tolerance_disease_assoc_tissue_expression_Gene_List file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tourette Syndrome Candidate Gene Analysis (TSCGA)

Running the program

Available parameters

Types of input files

Mutation file `--files/-f`

Database Files

Reflist files `--segpat/-s`

Annotation file `--anno/-a`

Gene tolerance score file `--genetol/-t`

Disease association file and diease key file `--disease/-d`

GTEx tpm file `--gtex/-g`

Other Gene information files `--genes/-e`

Output files `--output/-o`

Running the test files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Cand_Gene_MutGene_Pipeline.py		Cand_Gene_MutGene_Pipeline.py
Disease_assoc_TS0096test.txt		Disease_assoc_TS0096test.txt
GTEx_median_tpm_TS0096test.gct		GTEx_median_tpm_TS0096test.gct
GeneTolerance_TS0096test.txt		GeneTolerance_TS0096test.txt
Gene_List_TS0096test.txt		Gene_List_TS0096test.txt
README.md		README.md
TS0096_Dominant_test.tsv		TS0096_Dominant_test.tsv
TS0096_Dominant_test_segpat_annotated_gene_tolerance_disease_assoc_tissue_expression_Gene_List.tsv		TS0096_Dominant_test_segpat_annotated_gene_tolerance_disease_assoc_tissue_expression_Gene_List.tsv
TS0096_test.reflist		TS0096_test.reflist
TS_disease_keywords.txt		TS_disease_keywords.txt
annotation_hg19_testTS0096.vcf		annotation_hg19_testTS0096.vcf

JKoesterich/TS_CGA

Folders and files

Latest commit

History

Repository files navigation

Tourette Syndrome Candidate Gene Analysis (TSCGA)

Running the program

Available parameters

Types of input files

Mutation file --files/-f

Database Files

Reflist files --segpat/-s

Annotation file --anno/-a

Gene tolerance score file --genetol/-t

Disease association file and diease key file --disease/-d

GTEx tpm file --gtex/-g

Other Gene information files --genes/-e

Output files --output/-o

Running the test files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Mutation file `--files/-f`

Reflist files `--segpat/-s`

Annotation file `--anno/-a`

Gene tolerance score file `--genetol/-t`

Disease association file and diease key file `--disease/-d`

GTEx tpm file `--gtex/-g`

Other Gene information files `--genes/-e`

Output files `--output/-o`

Packages