This is the official implementation of the paper "Latent-based Directed Evolution for Protein Sequence Design".
Our repository is structured as follows:
.
├── active_optimize.sh # inference + active learning
├── environment.yml
├── exps # experiments results
├── optimize.sh # inference
├── preprocessed_data
├── README.md
├── scripts # main executable scripts
├── src
│ ├── common # common utilities
│ ├── dataio # dataloader
│ └── models
├── train.sh # training script
└── visualize_latent.sh # visualize trained latent
You should have Python 3.10 or higher. I highly recommend creating a virtual environment like conda. If so, run the below commands to install:
conda env create -f environment.yml
Download the oracle landscape models by the following commands (using script provided here):
cd scripts
bash download_landscape.sh
To train VAE model for each benchmark dataset, go to the root directory and execute the train.sh
file. Take avGFP
as the example, run the following command:
bash train.sh ./scripts/configs/rnn_template.py 0 template avGFP 20 256
Checkpoints will be saved in exps/ckpts/
folder. Details of passed arguments can be found here
To perform optimization, go to the root directory and execute the optimize.sh
file. Take avGFP
as the example, run the following command:
bash optimize.sh avGFP 0 template <model_ckpt_path> <oracle_ckpt_path> 1 rnn
Similar to perform active learning alongside with optimization, you can see details of passed argumetns in active_optmize.sh
file.
Results will be saved in exps/results_no_active
and exps/results
folders.
To average results of 5 seeds, check calculate.py
.
If you find our work useful for your research, please cite:
@article{10.1088/2632-2153/adc2e2,
author={Tran, Thanh and Ngo, Nhat Khang and Nguyen, Viet Thanh Duy and Hy, Truong-Son},
title={LatentDE: Latent-based Directed Evolution for Protein Sequence Design},
journal={Machine Learning: Science and Technology},
url={http://iopscience.iop.org/article/10.1088/2632-2153/adc2e2},
year={2025},
abstract={Directed evolution has been the most effective method for protein engineering that optimizes biological functionalities through a resource-intensive process of screening or selecting among a vast range of mutations. To mitigate this extensive procedure, recent advancements in machine learning-guided methodologies center around the establishment of a surrogate sequence-function model. In this paper, we propose Latent-based Directed Evolution (LDE), an evolutionary algorithm designed to prioritize the exploration of high-fitness mutants in the latent space. At its core, LDE is a regularized variational autoencoder (VAE), harnessing the capabilities of the state-of-the-art Protein Language Model (pLM), ESM-2, to construct a meaningful latent space of sequences. From this encoded representation, we present a novel approach for efficient traversal on the fitness landscape, employing a combination of gradient-based methods and directed evolution. Experimental evaluations conducted on eight protein sequence design tasks demonstrate the superior performance of our proposed LDE over previous baseline algorithms.}
}