Out-of-memory during fine-tuning #59

sch0ngut · 2023-11-22T15:18:23Z

I'm trying to run a fine-tuning on a dataset of roughly 1.5h of audio data with an average audio length of ~7.5s. As hardware I'm having 4 GeForce 3090 GPUs with 24GB RAM each. Unfortunately the training always crashes due to OOM (out-of-memory) after the first couple of steps. I've checked both the README and the discussion in issue #10, but none of the suggested things seem to work. I.e. even using values such as

max_len: 50
batch_precentage: 0.125

are not working. I'm using the command as suggested in the README, i.e.

python train_finetune.py --config_path ./Configs/config_ft.yml

where config_ft.yml is as before but updated with above values and my own dataset.
Any other suggestions on what I could do to make the training run and avoid running OOM?

The text was updated successfully, but these errors were encountered:

Kreevoz · 2023-11-22T15:50:12Z

Repo author mentioned in issue #48 that a max_len of at least 80 ( 80 * 300 / 24000 = 1 second audio duration ) is required as bare minimum. Shorter and you will not get any useful results.

batch_size can be reduced, but should be larger than 1.

yl4579 · 2023-11-22T17:27:36Z

Please check the colab demo: https://github.com/yl4579/StyleTTS2/blob/main/Colab/StyleTTS2_Finetune_Demo.ipynb. You can finetune with only a batch size of 2, but try not reduce max_length because it will significantly worsen the quality if it’s too short. Also the parameters you showed are for SLM adversarial training run, so it doesn’t matter for the first few epochs.

cnlinxi · 2023-11-25T13:48:12Z

@yl4579 Thank you for contributing such a great job.

BTW, why does StyleTTS-2 require so much GPU memory?

I tried to finetune it on A800 (80GB) and the only change I made was to set the batch size to 4, which requires nearly 68GB in epoch 15. At the beginning of fine-tuning, it seems to only update BERT/TextEncoder/ASR/StyleEncoderx2/ProsodyPredictor/Decoder/Diffusion. And in the joint training phase, it even needs to update the parameters of WavLM.

Is this normal?

yl4579 · 2023-11-25T18:40:39Z

@cnlinxi It doesn’t update the parameters of WavLM but it does use its gradient to train the generator. This is unfortunately one of the limitations of using large speech language models. Probably future works can resolve it. You can also skip the joint training part, but it will significantly worsen the quality as we discussed earlier in this thread.

Curlypla · 2023-11-26T19:54:57Z

I tried using bitsandbytes 8Bit optimizer but it was just wrong (I don't know anything about it so I may have done something wrong), but I end up with equal VRAM usage and slower speed.

yl4579 closed this as completed Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-memory during fine-tuning #59

Out-of-memory during fine-tuning #59

sch0ngut commented Nov 22, 2023

Kreevoz commented Nov 22, 2023

yl4579 commented Nov 22, 2023

cnlinxi commented Nov 25, 2023

yl4579 commented Nov 25, 2023 •

edited

Loading

Curlypla commented Nov 26, 2023

Out-of-memory during fine-tuning #59

Out-of-memory during fine-tuning #59

Comments

sch0ngut commented Nov 22, 2023

Kreevoz commented Nov 22, 2023

yl4579 commented Nov 22, 2023

cnlinxi commented Nov 25, 2023

yl4579 commented Nov 25, 2023 • edited Loading

Curlypla commented Nov 26, 2023

yl4579 commented Nov 25, 2023 •

edited

Loading