CVC is a language model trained on CDR3 T-cell Receptor (TCR) sequences, built using a lightly modified BERT architecture with tweaked pre-training objectives. The model creates meaningful embeddings that can be used for downstream tasks like classification.
scCVC is an updated version of CVC that was trained on single-cell TCR sequences (each single cell was represented by its CDR3 sequences). The model was trained using the same architecture as CVC. It creates meaningful embeddings both for single sequences and single cells.
To install CVC/scCVC:
- Clone the GitHub repository and create its requisite conda environment as follows.
Make sure you use a recent conda version, e.g. version=4.10 or above
conda env create -n my_env_name --file=environment.yml
conda activate my_env_name
- Upload model into the project base dir. The model is shared with a google drive link and can be downloaded using 'gdown'.
# install gdown
pip install --upgrade --no-cache-dir gdown
# run script to download model
# CVC
python -m scripts.download_cvc --model_type CVC
# scCVC
python -m scripts.download_cvc --model_type scCVC
- To create embeddings:
a. open notebook: lab_notebooks/create_embeddings.ipynb
b. edit cells under "Specify Parameters" with relative paths and model
c. run notebook
The data used to train each model are shared with a google drive link and can be downloaded using the following commands:
# install gdown
pip install --upgrade --no-cache-dir gdown
# run script to download data
# CVC
python -m scripts.download_cvc_training_data --model_type CVC
# scCVC
python -m scripts.download_cvc_training_data --model_type scCVC
To train the model on your own set of sequences, use the '--data_path' flag and give it the data file's path.
# CVC
# train CVC with default dataset
python3 bin/train_cvc.py --epochs 50 --bs 1024 --noneptune --datasets CUSTOM_DATASET --config ./model_configs/bert_defaults.json --outdir ./output_dir/
# train CVC with custom dataset
python3 bin/train_cvc.py --epochs 50 --bs 1024 --noneptune --datasets CUSTOM_DATASET --config ./model_configs/bert_defaults.json --outdir ./output_dir/ --data_path PATH_TO_CSV
# scCVC
# train scCVC model with default dataset
python -m bin.train_sc_cvc --epochs 50 --bs 128 --noneptune --pathdata ./scDATA_ready_for_training.csv --config ./model_configs/bert_defaults.json --outdir ./output_dir/
# train scCVC model with custom dataset
# use the preprocess_scData_for_training.ipynb notebook to preprocess your data
python -m bin.train_sc_cvc --epochs 50 --bs 128 --noneptune --pathdata PATH_TO_CSV --config ./model_configs/bert_defaults.json --outdir ./output_dir/
# Test CVC
python -m tests.test_create_embeddings
The main notebooks used in the paper are under the lab_notebooks folder and single_cell_research folder.
The lab_notebooks folder contains notebooks that are used to create the embeddings (mostly using CVC), analyze TCR data and run several downstream tasks.
Some of the more useful notebooks are:
- lab_notebooks/create_embeddings.ipynb is used to create the embeddings (using either CVC ot scCVC) for a given dataset.
- lab_notebooks/binary_classifiers.ipynb is used to run a binary classification (Public/Private or MAIT) on a given dataset.
- lab_notebooks/model_train_test_data_creation.ipynb is used to create the train/test data for training CVC or other classification tasks.
- lab_notebooks/Private_Public_labeling.ipynb is used to label the data as Public/Private.
- lab_notebooks/plot_embeddings_MAIT.ipynb is used to label the data as MAIT and analyze it.
The rest of the notebooks can be used to re-create the plots displayed in the paper, or new plots the given data.
The single_cell_research folder contains notebooks that are useful for single cell data analysis and creation of embeddings using scCVC.
- single_cell_research/preprocess_scData_for_training.ipynb is used to preprocess the single cell data for training scCVC.
- single_cell_research/embeddings_sc_data.ipynb is used to create and plot the embeddings for single cells (concatenated representation).