Count Von Count Models (CVC and scCVC)

CVC is a language model trained on CDR3 T-cell Receptor (TCR) sequences, built using a lightly modified BERT architecture with tweaked pre-training objectives. The model creates meaningful embeddings that can be used for downstream tasks like classification.

scCVC is an updated version of CVC that was trained on single-cell TCR sequences (each single cell was represented by its CDR3 sequences). The model was trained using the same architecture as CVC. It creates meaningful embeddings both for single sequences and single cells.

Installation and Usage

To install CVC/scCVC:

Clone the GitHub repository and create its requisite conda environment as follows.
Make sure you use a recent conda version, e.g. version=4.10 or above

conda env create -n my_env_name --file=environment.yml

conda activate my_env_name

Upload model into the project base dir. The model is shared with a google drive link and can be downloaded using 'gdown'.

# install gdown
pip install --upgrade --no-cache-dir gdown

# run script to download model
# CVC
python -m scripts.download_cvc --model_type CVC
# scCVC
python -m scripts.download_cvc --model_type scCVC

To create embeddings:
a. open notebook: lab_notebooks/create_embeddings.ipynb
b. edit cells under "Specify Parameters" with relative paths and model
c. run notebook

Train Model

The data used to train each model are shared with a google drive link and can be downloaded using the following commands:

# install gdown
pip install --upgrade --no-cache-dir gdown

# run script to download data
# CVC
python -m scripts.download_cvc_training_data --model_type CVC
# scCVC
python -m scripts.download_cvc_training_data --model_type scCVC

To train the model on your own set of sequences, use the '--data_path' flag and give it the data file's path.

# CVC
# train CVC with default dataset
python3 bin/train_cvc.py --epochs 50 --bs 1024 --noneptune --datasets CUSTOM_DATASET --config ./model_configs/bert_defaults.json --outdir ./output_dir/

# train CVC with custom dataset
python3 bin/train_cvc.py --epochs 50 --bs 1024 --noneptune --datasets CUSTOM_DATASET --config ./model_configs/bert_defaults.json --outdir ./output_dir/ --data_path PATH_TO_CSV

# scCVC
# train scCVC model with default dataset
python -m bin.train_sc_cvc --epochs 50 --bs 128 --noneptune --pathdata ./scDATA_ready_for_training.csv --config ./model_configs/bert_defaults.json --outdir ./output_dir/

# train scCVC model with custom dataset
# use the preprocess_scData_for_training.ipynb notebook to preprocess your data
python -m bin.train_sc_cvc --epochs 50 --bs 128 --noneptune --pathdata PATH_TO_CSV --config ./model_configs/bert_defaults.json --outdir ./output_dir/

Test Embedding Creation

# Test CVC
python -m tests.test_create_embeddings

Notebooks

The main notebooks used in the paper are under the lab_notebooks folder and single_cell_research folder.

The lab_notebooks folder contains notebooks that are used to create the embeddings (mostly using CVC), analyze TCR data and run several downstream tasks.
Some of the more useful notebooks are:

lab_notebooks/create_embeddings.ipynb is used to create the embeddings (using either CVC ot scCVC) for a given dataset.
lab_notebooks/binary_classifiers.ipynb is used to run a binary classification (Public/Private or MAIT) on a given dataset.
lab_notebooks/model_train_test_data_creation.ipynb is used to create the train/test data for training CVC or other classification tasks.
lab_notebooks/Private_Public_labeling.ipynb is used to label the data as Public/Private.
lab_notebooks/plot_embeddings_MAIT.ipynb is used to label the data as MAIT and analyze it.

The rest of the notebooks can be used to re-create the plots displayed in the paper, or new plots the given data.

The single_cell_research folder contains notebooks that are useful for single cell data analysis and creation of embeddings using scCVC.

single_cell_research/preprocess_scData_for_training.ipynb is used to preprocess the single cell data for training scCVC.
single_cell_research/embeddings_sc_data.ipynb is used to create and plot the embeddings for single cells (concatenated representation).

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
TCR_Repertoire_Clustering		TCR_Repertoire_Clustering
bin		bin
cvc		cvc
data		data
lab_notebooks		lab_notebooks
model_configs		model_configs
plots/article_figures		plots/article_figures
scripts		scripts
single_cell_research		single_cell_research
tests		tests
.gitignore		.gitignore
CVC-scCVC_pipeline.png		CVC-scCVC_pipeline.png
Dockerfile		Dockerfile
LICENCE		LICENCE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Count Von Count Models (CVC and scCVC)

Installation and Usage

Train Model

Test Embedding Creation

Notebooks

About

Releases

Packages

Languages

License

RomiGoldner/CVC

Folders and files

Latest commit

History

Repository files navigation

Count Von Count Models (CVC and scCVC)

Installation and Usage

Train Model

Test Embedding Creation

Notebooks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages