Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing

[CVPR2025] Official implementation of paper "Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing"

🌼 Environment

Our python version is 3.8.18 and cuda version 11.8. It's possible to have another compatible version. Both training and inference are implemented with PyTorch on a GeForce RTX 4090 GPU.

conda create -n dubbing python=3.8.18
conda activate dubbing
pip install -r requirements.txt

🔧 Training

For First Stage (Acoustic Pre-training)

python train_first.py -p Configs/config_v2c_stage1.yml  # V2C-Animation benchmark
python train_first.py -p Configs/config_grid_stage1.yml  # GRID benchmark

For Second Stage (Prosody Adapting)

python train_second.py -p Configs/config_v2c.yml  # V2C-Animation benchmark
python train_second_grid.py -p Configs/config_grid.yml  # GRID benchmark

💡 Checkpoints

We provide the first stage and second stage pre-trained checkpoints on V2C-Animation and GRID benchmarks as follows, respectively:

First stage (For secnod stage only, can not directly generate wavform)

V2C-Animation benchmark: Baidu Drive (b5wy), Google Drive.
GRID benchmark: Baidu Drive (wj25), Google Drive

Second stage (Can used to directly generate waveform)

V2C-Animation benchmark: Baidu Drive (3k4h), Google Drive.
GRID benchmark: Baidu Drive (23vd), Google Drive

✍ Inference

For V2C-Animation Benchmark

There is three generation settings in V2C-Animation benchmark:

python inference_v2c.py -n 'YOUR_EXP_NAME' --epoch 'YOUR_EPOCH' --setting 1

python inference_v2c.py -n 'YOUR_EXP_NAME' --epoch 'YOUR_EPOCH' --setting 2

python inference_v2c.py -n 'YOUR_EXP_NAME' --epoch 'YOUR_EPOCH' --setting 3

For GRID benchmark

There is two generation settings in GRID benchmark:

python inference_grid.py -n 'YOUR_EXP_NAME' --epoch 'YOUR_EPOCH' --setting 1

python inference_grid.py -n 'YOUR_EXP_NAME' --epoch 'YOUR_EPOCH' --setting 2

📊 Dataset

GRID (BaiduDrive (code: GRID) / GoogleDrive)
V2C-Animation dataset (chenqi-Denoise2) (BaiduDrive (code: k9mb) / GoogleDrive)

🙏 Acknowledgments

We would like to thank the authors of previous related projects for generously sharing their code and insights: StyleTTS, StyleTTS2, StyleDubber, PL-BERT, and HiFi-GAN.

🤝 Ciation

If you find our work useful, please consider citing:

@misc{zhang2025produbber,
      title={Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing}, 
      author={Zhedong Zhang and Liang Li and Chenggang Yan and Chunshan Liu and Anton van den Hengel and Yuankai Qi},
      year={2025},
      eprint={2503.12042},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2503.12042}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Configs		Configs
Data		Data
Figure		Figure
Modules		Modules
Utils		Utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.sh		inference.sh
inference_grid.py		inference_grid.py
inference_v2c.py		inference_v2c.py
losses.py		losses.py
meldataset.py		meldataset.py
models.py		models.py
optimizers.py		optimizers.py
requirements.txt		requirements.txt
text_utils.py		text_utils.py
train_first.py		train_first.py
train_second.py		train_second.py
train_second_grid.py		train_second_grid.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing

🌼 Environment

🔧 Training

For First Stage (Acoustic Pre-training)

For Second Stage (Prosody Adapting)

💡 Checkpoints

First stage (For secnod stage only, can not directly generate wavform)

Second stage (Can used to directly generate waveform)

✍ Inference

For V2C-Animation Benchmark

For GRID benchmark

📊 Dataset

🙏 Acknowledgments

🤝 Ciation

About

Releases

Packages

Languages

License

ZZDoog/ProDubber

Folders and files

Latest commit

History

Repository files navigation

Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing

🌼 Environment

🔧 Training

For First Stage (Acoustic Pre-training)

For Second Stage (Prosody Adapting)

💡 Checkpoints

First stage (For secnod stage only, can not directly generate wavform)

Second stage (Can used to directly generate waveform)

✍ Inference

For V2C-Animation Benchmark

For GRID benchmark

📊 Dataset

🙏 Acknowledgments

🤝 Ciation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages