Skip to content

KyoungYeulLee/3Cnet

Repository files navigation

3Cnet: pathogenicity prediction of human variants using multitask learning with evolutionary constraints

Update logs

Mar 11, 2024

  • A sample trained model is added to the github. (3Cnet/MT_models/36.pt)

    • You can test the model even without GPU resources. (automatically use CPU instead)
    • Training a new model may need GPU resources.
  • To test the model

$ python model_evaluator.py -e 36
  • To evaluate your own variants, you may need to add a HGVSp written file and change omegaconf.yaml.

Feb 7, 2024

  • Released 3Cnet version 2.0
    • Please see: https://zenodo.org/records/10212255
    • Major changes
      • 3Cnet v2 is no longer dependent to SNVBOX features.
      • Almost all types of in-exon variants can be inferred (see neuron/constants.py).
      • Better performance compared to 3Cnet v1 (ROC-AUC = 91% -> 93% for external clinvar).

Feb 7, 2022

May 7, 2021

  • Initial release of 3Cnet

Installation

3Cnet ver.2 was trained using the following versions of software:

We recommend you have at least 40GB of free storage.

STEP 1: Clone the 3Cnet repository

$ git clone https://github.com/KyoungYeulLee/3Cnet.git

STEP 2: Set up environment

We assume that you are running our model on one or more NVIDIA GPUs.

Option 1: Use Docker (recommended)

Install Docker and nvidia-container-toolkit

Docker Engine (we use Docker 20.10.9)

https://docs.docker.com/engine/install/

NVIDIA/container-toolkit (to use NVIDIA GPUs)

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

If you don't usually have access to root/sudo, consider Docker Rootless

https://docs.docker.com/engine/security/rootless/

Build the 3Cnet Docker image

$ sudo docker build -t 3billion/3cnet:v2.0.0 .

Run docker image interactively

$ sudo docker run --gpus all -it -v $(pwd):/workspace 3billion/3cnet:v2.0.0 bash
$ cd workspace

Option 2: Install using pip

pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

STEP 3: Run download_data.py to retrieve necessary files from Zenodo

(uses requests and tqdm)

$ python download_data.py

TODO

Code execution (continuing from data download)

  1. To train 3Cnet
$ python model_trainer.py -s model_name
  1. To evaluate 3Cnet performance
$ python model_evaluator.py -m model_name -e 30 -s test_result

Note that you need to select a proper epoch number (30 in the example)


Data and file structures

  • download_data.py: Retrieves data/ directory from Zenodo.
  • model_trainer.py: Top-level script for 3Cnet training. Outcome includes model parameters, training log, config backup
  • model_evaluator.py: Evaluate model using trained model parameters. The test result will be saved in the model dir (pred.tsv)
  • omegaconf.yaml: Anaconda-compatible environment yaml.

neuron

  • aa_to_int_mappings.py: mapping between amino-acid string to integer representation.
  • constants.py: definition of variants used in this project.
  • errors.py: definition of errors.
  • seq_database.py: Script that parse sequence information from the data.
  • seq_collection.py: Script that define the collection of sequence objects.
  • sequences.py: Script that define the sequence object.
  • featurizer.py: Script that featurize sequence object into trainable features.
  • utils.py: Utility script.

cccnet

  • dataset_builder.py: Class that build pytorch dataset from HGVSp written files
  • torch_dataset.py: Dataset class definition for 3Cnet.
  • torch_network.py: The 3Cnet architecture is defined here (nn.Module).
  • deep_utils.py: Utility script for deep learning.
  • utils.py: Utility script.

data

  • reference_sequences.tsv: the file containing sequence ID and its amino-acid sequence.

  • msa_arrays/: NP_*.npy files representing each residues of conservative proportion of 21-amino acids

  • train_hgvsps/

    • train_clinvar_hgvsps.tsv: pathogenic-or-benign-labeled variants from ClinVar
    • train_gnomad_hgvsps.tsv: benign-labeled variants from gnomAD
    • train_conservation_hgvsps.tsv: pathogenic-like and benign-like variants inferred from conservation data
  • test_hgvsps/: Contains data pertaining to the external clinvar test set and patient data test results.

    • test_clinvar_missense_hgvsps.tsv: Variants from external clinvar (missense variants)
    • test_clinvar_non-missense_hgvsps.tsv: Variants from external clinvar (non-missense variants)
    • test_inhouse_hgvsps.tsv: inhouse patients variants (missense variants)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages