Kinase activity inference from phosphosproteomics data based on substrate sequence specificity
Current version:
0.10.0
Research paper: https://doi.org/10.1101/2024.03.22.586304
PhosX infers differential kinase activities from phosphoproteomics data without requiring any prior knowledge database of kinase-phosphosite associations. PhosX assigns the detected phosphopeptides to potential upstream kinases based on experimentally determined substrate sequence specificities, and it tests the enrichment of a kinase's potential substrates in the extremes of a ranked list of phosphopeptides using a Kolmogorov-Smirnov-like statistic. A p value for this statistic is extracted empirically by random permutations of the phosphosite ranks.
From PyPI
pip install phosx
From source (requires Poetry)
poetry build
pip install dist/*.whl
PhosX can be used as a command line tool (phosx
) with minimal effort. Its output is redirected by default in STDOUT
, making it easy to use in bioinformatics pipelines. Alternatively, the user can specify an output filename (option -o
).
Example: run PhosX with default parameters on an example dataset, using up to 8 cores, and redirecting the output table to kinase_activities.tsv
:
phosx -c 8 tests/seqrnk/koksal2018_log2.fold.change.8min.seqrnk > kinase_activities.tmp
A full description of the command line options can be viewed with phosx -h
:
usage: phosx [-h] [-yp Y_PSSM] [-stp S_T_PSSM] [-yq Y_PSSM_QUANTILES] [-stq S_T_PSSM_QUANTILES] [-n N_PERMUTATIONS] [-stk S_T_N_TOP_KINASES] [-yk Y_N_TOP_KINASES] [-mh MIN_N_HITS] [-mp MIN_QUANTILE] [-c N_PROC] [--plot-figures] [-d OUTPUT_DIR]
[-o OUTPUT_PATH] [-v]
seqrnk
Kinase activity inference from phosphosproteomics data based on substrate sequence specificity
positional arguments:
seqrnk Path to the seqrnk file.
options:
-h, --help show this help message and exit
-yp Y_PSSM, --y-pssm Y_PSSM
Path to the h5 file storing custom Tyr PSSMs; defaults to built-in PSSMs
-stp S_T_PSSM, --s-t-pssm S_T_PSSM
Path to the h5 file storing custom Ser/Thr PSSMs; defaults to built-in PSSMs
-yq Y_PSSM_QUANTILES, --y-pssm-quantiles Y_PSSM_QUANTILES
Path to the h5 file storing custom Tyr kinases PSSM score quantile distributions under the key 'pssm_scores'; defaults to built-in PSSM scores quantiles
-stq S_T_PSSM_QUANTILES, --s-t-pssm-quantiles S_T_PSSM_QUANTILES
Path to the h5 file storing custom Ser/Thr kinases PSSM score quantile distributions under the key 'pssm_scores'; defaults to built-in PSSM scores quantiles
-n N_PERMUTATIONS, --n-permutations N_PERMUTATIONS
Number of random permutations; default: 1000
-stk S_T_N_TOP_KINASES, --s-t-n-top-kinases S_T_N_TOP_KINASES
Number of top-scoring Ser/Thr kinases potentially associatiated to a given phosphosite; default: 5
-yk Y_N_TOP_KINASES, --y-n-top-kinases Y_N_TOP_KINASES
Number of top-scoring Tyr kinases potentially associatiated to a given phosphosite; default: 5
-mh MIN_N_HITS, --min-n-hits MIN_N_HITS
Minimum number of phosphosites associated with a kinase for the kinase to be considered in the analysis; default: 4
-mp MIN_QUANTILE, --min-quantile MIN_QUANTILE
Minimum PSSM score quantile that a phosphosite has to satisfy to be potentially assigned to a kinase; default: 0.95
-c N_PROC, --n-proc N_PROC
Number of cores used for multithreading; default: 1
--plot-figures Save figures in pdf format; see also --output_dir
-d OUTPUT_DIR, --output-dir OUTPUT_DIR
Output files directory; only relevant if used with --plot_figures; defaults to 'phosx_output/'
-o OUTPUT_PATH, --output-path OUTPUT_PATH
Main output table; if not specified it will be printed in STDOUT
-v, --version Print package version and exit
For a full description of the method please see the Method section and the manuscript.
PhosX's input format is a simple text file that we name seqrnk. It consists of 2 tab-separated columns containing phosphopeptide sequences and values, respectively. The values should be biologically relevant measures of differential phosphorylation, typically intensity log fold changes as obtained when comparing two conditions in mass spectrometry experiments. Amino acid sequences should be of length _
. Every other residue is represented by the corresponding 1-letter symbol according to the IUPAC nomenclature for amino acids and additional phosphorylated Serine, Threonine or Tyrosine residues are represented with the symbols s
, t
, and y
, respectively. Phosphorylated residues that act as potential priming sites and are therefore not in the
$ head tests/seqrnk/koksal2018_log2.fold.change.8min.seqrnk
QEEAEYVRAL 5.646644
ANFSAYPSEE 4.33437
YLNRNYWEKK 4.174151
AENAEYLRVA 3.685413
STYTSYPKAE 3.491975
SFLQRYSSDP 3.295341
AAEPGSPTAA 3.202242
EPAHAYAQPQ 3.160899
RQKSTYTSYP 3.114077
ETKSLYPSSE 3.04653
Alongside the main program, this package also installs make-seqrnk
. This utily can be used to help generating a seqrnk file given a list of phosphosites, each one identified by a UniProt Acession Number and residue coordinate. make-seqrnk
will query the UniProt database to fetch the appropriate subsequences to build the seqrnk file. See make-seqrnk -h
for more details:
usage: make-seqrnk [-h] [-i INPUT] [-o OUTPUT]
Make a seqrnk file to be used to compute differential kinase activity with PhosX
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Path of the input phosphosites to be converted in seqrnk format. It should be a TSV file where the 1st column is the UniProtAC (str), the 2nd is the sequence coordinate (int), and the 3rd is the logFC (float); defaults to STDIN
-o OUTPUT, --output OUTPUT
Path of the seqrnk file; if not specified it will be printed in STDOUT
Run an example:
cat tests/p_list/15_3.tsv | make-seqrnk > 15_3.seqrnk
PhosX estimates the affinity between human kinases and phosphopeptides based on the substrate sequence specificity encoded in Position Specific Scoring Matrices (PSSMs). A kinase PSSM is a
PhosX comes with built-in, default PSSMs for human kinases, that can be found at phosx/data/*_PSSMs.h5
. The user can also run PhosX using custom PSSMs, whose path can be specified with the options -yp
and -stp
.
For convenience, here is a function that can be used to open and inspect the structure of the Hierarchical Data Format version 5 (HDF5) files where the PSSMs are stored:
import h5py
AA_LIST = [
"G","P","A","V","L","I",
"M","C","F","Y","W","H",
"K","R","Q","N","E","D",
"S","T","s","t","y",
]
POSITIONS_LIST = list(range(-5, 5))
def read_pssms(pssms_h5_file: str):
pssms_h5 = h5py.File(pssms_h5_file, "r")
pssm_df_dict = {}
for kinase in pssms_h5.keys():
pssm_df_dict[kinase] = pd.DataFrame(pssms_h5[kinase])
pssm_df_dict[kinase].columns = AA_LIST
pssm_df_dict[kinase].index = POSITIONS_LIST
return pssm_df_dict
Similarly, PhosX also has built-in kinase PSSM scores quantile distributions computed on a reference human phosphoproteome from the PhosphositePlus database. These can be found at phosx/data/*_PSSM_score_quantiles.h5
. When supplying custom PSSMs, it is necessary to also specify the appropriate background distributions with the options -yq
and -stq
. Inspect the HDF5 files containing the background PSSM scores with:
import h5py
def read_pssm_score_quantiles(pssm_score_quantiles_h5_file: str):
pssm_bg_scores_df = pd.read_hdf(pssm_score_quantiles_h5_file, key="pssm_scores")
return pssm_bg_scores_df
PhosX's main output is a text file reporting the computed kinase activities with associated statistics as described in the Method section. For each kinase, the KS statistics, the p value, the FDR q value, and the Activity Score are reported. Note that kinases for which an Activity Score could not be computed (for example for lack of matching phosphopeptides in the input data) are omitted in the current version. See an output example from the command executed above:
$ head kinase_activities.tmp
KS p value FDR q value Activity Score
AAK1 -0.2476131530554456 0.533 1.0 -0.2732727909734277
ACVR1B -0.36580307230946174 0.078 1.0 -1.1079053973095196
ACVR2A 0.2259224236806207 0.439 1.0 0.35753547975787864
ACVR2B 0.47014516632215597 0.019 1.0 1.7212463990471711
ALK2 -0.25190558854944195 0.276 1.0 -0.5590909179347823
ALPHAK3 -0.2875279855211264 0.358 1.0 -0.44611697335612566
ALPK3 0.5759513630398431 0.041 1.0 1.3872161432802645
AMPKA2 0.4401107718873606 0.299 1.0 0.5243288116755703
ATM -0.49337471491068263 0.096 1.0 -1.0177287669604316
Additionally, PhosX can also save plots of the of the weighted running sum and of the KS statistic compared to its empirical null distribution, similarly to the ones show above, for each kinase. To enable this behavior the option --plot-figures
must be specified. A custom directory to save the plots can be passed with -d
.
For each kinase PSSM, a score is assigned to each phosphopeptide sequence
where
PhosX uses the PSSM scores to link kinases to their potential substrates. Each phosphopeptide is assigned as potential target to its
where
The kinase enrichment score (
For each kinase, PhosX computes an empirical p value of the
The activity score for a given kinase is defined as:
where \texttt{sign} is the sign function, and
Please cite one of the following references if you use PhosX in your work.
BibTeX:
@article{Lussana2024,
title = {PhosX: data-driven kinase activity inference from phosphoproteomics experiments},
url = {http://dx.doi.org/10.1101/2024.03.22.586304},
DOI = {10.1101/2024.03.22.586304},
publisher = {Cold Spring Harbor Laboratory},
author = {Lussana, Alessandro and Petsalaki, Evangelia},
year = {2024},
month = mar
}
BibTeX:
@article{10.1093/bioinformatics/btae697,
author = {Lussana, Alessandro and Müller-Dott, Sophia and Saez-Rodriguez, Julio and Petsalaki, Evangelia},
title = {PhosX: data-driven kinase activity inference from phosphoproteomics experiments},
journal = {Bioinformatics},
volume = {40},
number = {12},
pages = {btae697},
year = {2024},
month = {11},
abstract = {The inference of kinase activity from phosphoproteomics data can point to causal mechanisms driving signalling processes and potential drug targets. Identifying the kinases whose change in activity explains the observed phosphorylation profiles, however, remains challenging, and constrained by the manually curated knowledge of kinase–substrate associations. Recently, experimentally determined substrate sequence specificities of human kinases have become available, but robust methods to exploit this new data for kinase activity inference are still missing. We present PhosX, a method to estimate differential kinase activity from phosphoproteomics data that combines state-of-the-art statistics in enrichment analysis with kinases’ substrate sequence specificity information. Using a large phosphoproteomics dataset with known differentially regulated kinases we show that our method identifies upregulated and downregulated kinases by only relying on the input phosphopeptides’ sequences and intensity changes. We find that PhosX outperforms the currently available approach for the same task, and performs better or similarly to state-of-the-art methods that rely on previously known kinase–substrate associations. We therefore recommend its use for data-driven kinase activity inference.PhosX is implemented in Python, open-source under the Apache-2.0 licence, and distributed on the Python Package Index. The code is available on GitHub (https://github.com/alussana/phosx).},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btae697},
url = {https://doi.org/10.1093/bioinformatics/btae697},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/40/12/btae697/60972735/btae697.pdf},
}